# EEE3027 Lab 5 - Multithreaded Machine Learning (ML) Processes

In last week's lab session, we showcased the design of the IoT 'Sensor - Processor' Co-Simulation Example and introduced the concepts of concurrency through the use of mutexes and semaphores via multithreading.

For this week's lab session, we extend this example by introducing a singular 'Sensor' which we will call 'DataLoader' and two 'Processors' which we will call 'TrainerReader' and 'InfererReader' respectively.

Although this example is relatively simple compared to how the data source is actually managed in the coursework example, these concepts, however, can still in principle be applied to sufficiently (and cleverly) split the data to both the ML training and ML inference. As such, this example aims to teach you these concepts in order to complete the following tasks:

> 1/ Assuming you have created the class for Data Preprocessing (i.e. Class 1), create **two more separate classes**: **one for ML Training** (i.e. Class 2) that uses _chunks of data between Class 1 and Class 2_, and **another for ML Inference** (i.e. Class 3) that uses a _single datapoint between Class 1 and Class 3_.
>
> 2/ Now, assuming that there is no concurrency between Class 1 and Class 2 as well as between Class 1 and Class 3, find a way to introduce **two new shared memories** (one memory between Class 1 and 2, and one memory between Class 1 and 3) with _limited capacity_ like in the sensor-processor co-simulation example. For this, you can assume that **Class 1 is a producer** and that **Classes 2 and 3 are consumers**.

## Block Diagram of Extended Example
Below shows the block diagram of our extended diagram involving the data loader, trainer and inferer. For this example, let us assume that the data has already been preprocessed.

> **_Note:_** In this example, the previous 'Memory/Buffer' module from the IoT 'Sensor-Processor' Co-Simulation Example has been renamed 'Data source', though they are nearly identical and still serve entirely the same purpose.

<center><img src='loader-trainer-inference-block-diagram.png'/></center>

Typically, in ML models, these stages are often ran in the following order:

1) Loading (and pre-processing) the data.
2) Training the ML model using this data.
3) Infering the ML model using this data.

One possible interesting (and potentially optimizable) approach is to turn the training and inference of the ML model into **concurrent processes**.

> **_Note:_** While parallelism can be seen as a form of concurrency, concurrency however cannot be seen as a form of parallelism. This is because concurrency does not necessarily mean to perform every task of a system at the same time, but rather to have the **ability to perform multiple tasks at the same time** (e.g. performing *1 or more* tasks as opposed to **sequential** meaning *1 task at a time* and **parallelism** usually meaning *more than 1 task at a time*).

A key question from this is: **_How can we turn these sequential operations into concurrent ones?_**

To understand the procedure needed, let us start from a sequential implementation and slowly modify each stage to transform it into a (sufficiently) concurrent implementation.

## Timing assumptions of our processes

Before we cover the Python code implementations, let us consider the timing of our processes with the following assumptions:

- The data loader process requires _1 second_ to add 1 bit/datapoint to the memory.
- The trainer reader process requires _5 seconds_ to read 3 bits/datapoints from the memory.
- The inferer reader process requires _2 seconds_ to read 1 bit/datapoint from the memory.

> **_Note:_** As with any code, the timing values added to these operations are fictional and do not accurately represent the real world, where such timings can vary drastically. The purpose of adding these arbitrary timing values is to mainly show the (theoretical) difference that a concurrent design can make compared to a sequential design, and understand _why_ we need such design choices.
>
> For example: _Does the ML inference really need to wait for the ML training to finish if their input data are **mutually exclusive**_? We technically do not need to train the model before we infer it, but its obvious that our output will be likely awful.

## Sequential Python Implementation

The first, yet simplest, way to implement the above block diagram as Python code is to simply implement each 'process' sequentially like below.

As such, the flow of our sequential Python code implementation will be [Data Loader]&#8594;[Trainer]&#8594;[Inferer].

> **_Note:_** While we are not explicitly creating any threads nor functions for our data loader, trainer and inferer, the operations that are shown below will follow the exact order in how each process will behave, e.g. this should code should (nearly) behave in the exact same way as if we created each thread but have trainer wait for data loader to join and have inferer wait for both data loader and trainer to join, and if we created the functions for each process but ran the function in the same order of data loader then trainer then inferer.

In [None]:
import time
import random

random.seed(0)

CAPACITY = 15
DATALOADER_LOAD_THRESHOLD = 1
TRAINER_READ_THRESHOLD = 3
INFERER_READ_THRESHOLD = 1

DATALOADER_WAIT_TIME = 1
TRAINER_WAIT_TIME = 5
INFERER_WAIT_TIME = 2

data = [-1 for i in range(CAPACITY)]

print('[CONSOLE] Operation started.')

main_start_time = time.time()

# Producer 1 operation -- DataLoader
start_time = time.time()
datapoints_loaded = 0
in_index = 0
while datapoints_loaded < CAPACITY:
    datapoint = random.randint(1, 100)
    data[in_index] = datapoint
    print("Loader added datapoint: ", str(datapoint), " to data index: ", str(in_index + 1))
    in_index = (in_index + DATALOADER_LOAD_THRESHOLD) % CAPACITY
    datapoints_loaded += DATALOADER_LOAD_THRESHOLD
    time.sleep(DATALOADER_WAIT_TIME)
end_time = time.time()
print("Loader has taken ", str(end_time - start_time), " seconds to complete")

# Consumer 1 operation -- Trainer, reads 5 datapoints
start_time = time.time()
read_datapoints_in_trainer = 0
out_index_trainer = 0
while read_datapoints_in_trainer < CAPACITY:
    trainer_data_to_load = []
    for i in range(out_index_trainer, out_index_trainer + 3):
        trainer_data_to_load += [data[i]]
    print("Trainer read datapoints: ", str(trainer_data_to_load), " from indices: ", str(out_index_trainer + 1), " to ", str(out_index_trainer + 3))
    out_index_trainer = (out_index_trainer + TRAINER_READ_THRESHOLD) % CAPACITY
    read_datapoints_in_trainer += TRAINER_READ_THRESHOLD
    time.sleep(TRAINER_WAIT_TIME)
end_time = time.time()
print("Trainer has taken ", str(end_time - start_time), " seconds to complete")

# Consumer 2 operation -- Inference, reads a single datapoint
start_time = time.time()
read_datapoints_in_inferer = 0
out_index_inferer = 0
while read_datapoints_in_inferer < CAPACITY:
    inferer_data_to_load = data[out_index_inferer]
    print("Inferer read datapoint: ", str(inferer_data_to_load), " from data index: ", str(out_index_inferer + 1))
    out_index_trainer = (out_index_trainer + INFERER_READ_THRESHOLD) % CAPACITY
    read_datapoints_in_inferer += INFERER_READ_THRESHOLD
    out_index_inferer += 1
    time.sleep(INFERER_WAIT_TIME)
end_time = time.time()
print("Inferer has taken ", str(end_time - start_time), " seconds to complete")

main_end_time = time.time()

print('[CONSOLE] Operation completed.')
print('[CONSOLE] Total operation time taken is ', str(main_end_time - main_start_time))

[CONSOLE] Operation started.
Loader added datapoint:  50  to data index:  1
Loader added datapoint:  98  to data index:  2
Loader added datapoint:  54  to data index:  3
Loader added datapoint:  6  to data index:  4
Loader added datapoint:  34  to data index:  5
Loader added datapoint:  66  to data index:  6
Loader added datapoint:  63  to data index:  7
Loader added datapoint:  52  to data index:  8
Loader added datapoint:  39  to data index:  9
Loader added datapoint:  62  to data index:  10
Loader added datapoint:  46  to data index:  11
Loader added datapoint:  75  to data index:  12
Loader added datapoint:  28  to data index:  13
Loader added datapoint:  65  to data index:  14
Loader added datapoint:  18  to data index:  15
Loader has taken  15.113819360733032  seconds to complete
Trainer read datapoints:  [50, 98, 54]  from indices:  1  to  3
Trainer read datapoints:  [6, 34, 66]  from indices:  4  to  6
Trainer read datapoints:  [63, 52, 39]  from indices:  7  to  9
Trainer read

## Initial Concurrent Python Implementation

When investigating the code, it is quite clear that we have a bottleneck from running these processes sequentially.

One way to improve this implementation is to introduce some concurrency to it, using the concepts we shown last week - i.e. the use of mutexes and sempahores through Python's threading library.

With this change, the flow of our initial concurrent Python code implementation start from [Data Loader] and peform a *choice* of either: 1) [Data Loader] to add more data points, 2) [Trainer] if there is at least 3 data points stored, or 3) [Inferer] if there is at least 1 data point stored.

By introducing these concepts, our Python code should now resemble the block diagram below.

<center><img src='loader-trainer-inference-block-diagram-concurrent.png' /></center>

In [None]:
import time
import random
import threading

random.seed(0)

CAPACITY = 15
DATALOADER_LOAD_THRESHOLD = 1
TRAINER_READ_THRESHOLD = 3
INFERER_READ_THRESHOLD = 1

DATALOADER_WAIT_TIME = 1
TRAINER_WAIT_TIME = 5
INFERER_WAIT_TIME = 2

data = [-1 for i in range(CAPACITY)]

mutex = threading.Semaphore()
empty1 = threading.Semaphore(CAPACITY)
empty2 = threading.Semaphore(CAPACITY)
full1 = threading.Semaphore(0)
full2 = threading.Semaphore(0)

# Producer 1 operation -- DataLoader
class DataLoader(threading.Thread):
    def run(self):
        start_time = time.time()
        datapoints_loaded = 0
        in_index = 0
        while datapoints_loaded < CAPACITY:

            time.sleep(DATALOADER_WAIT_TIME)

            empty1.acquire()
            empty2.acquire()
            mutex.acquire()

            datapoint = random.randint(1, 100)
            data[in_index] = datapoint
            print("Loader " , str(f'{self.name=}'), " added datapoint: ", str(datapoint), " to data index: ", str(in_index + 1))
            in_index = (in_index + DATALOADER_LOAD_THRESHOLD) % CAPACITY
            datapoints_loaded += DATALOADER_LOAD_THRESHOLD

            mutex.release()
            full1.release()
            full2.release()
        end_time = time.time()
        print("Loader " , str(f'{self.name=}'), " completed in ", str(end_time - start_time), " seconds")

# Consumer 1 operation -- Trainer, reads 5 datapoints
# class TrainerReader(threading.Thread):
class TrainerReader(threading.Thread):
    def run(self):
        start_time = time.time()
        read_datapoints_in_trainer = 0
        out_index_trainer = 0
        while read_datapoints_in_trainer < CAPACITY:

            time.sleep(TRAINER_WAIT_TIME)

            t_counter = TRAINER_READ_THRESHOLD
            while t_counter > 0:
                full1.acquire()
                t_counter -= 1
            mutex.acquire()

            trainer_data_to_load = []
            for i in range(out_index_trainer, out_index_trainer + 3):
                trainer_data_to_load += [data[i]]
            print("Trainer ", str(f'{self.name=}'), "read datapoints: ", str(trainer_data_to_load), " from data indices: ", str(out_index_trainer + 1), " to ", str(out_index_trainer + 3))
            out_index_trainer = (out_index_trainer + TRAINER_READ_THRESHOLD) % CAPACITY
            read_datapoints_in_trainer += TRAINER_READ_THRESHOLD

            mutex.release()
            for i in range(TRAINER_READ_THRESHOLD):
                empty1.release()
        end_time = time.time()
        print("Trainer " , str(f'{self.name=}'), " completed in ", str(end_time - start_time), " seconds")

# Consumer 2 operation -- Inference, reads a single datapoint
class InfererReader(threading.Thread):
    def run(self):
        start_time = time.time()
        read_datapoints_in_inferer = 0
        out_index_inferer = 0
        while read_datapoints_in_inferer < CAPACITY:

            time.sleep(INFERER_WAIT_TIME)

            i_counter = INFERER_READ_THRESHOLD
            while i_counter > 0:
                full2.acquire()
                i_counter -= 1
            mutex.acquire()

            inferer_data_to_load = data[out_index_inferer]
            print("Inferer", str(f'{self.name=}'), "read datapoint: ", str(inferer_data_to_load), " from data index: ", str(out_index_inferer + 1))
            out_index_inferer = (out_index_inferer + INFERER_READ_THRESHOLD) % CAPACITY
            read_datapoints_in_inferer += INFERER_READ_THRESHOLD

            mutex.release()
            for i in range(INFERER_READ_THRESHOLD):
                empty2.release()
        end_time = time.time()
        print("Inferer " , str(f'{self.name=}'), " completed in ", str(end_time - start_time), " seconds")

print('[CONSOLE] Operation started.')

main_start_time = time.time()

# Creating Threads
data_load = DataLoader()
train_read = TrainerReader()
infer_read = InfererReader()

# Starting Threads
data_load.start()
train_read.start()
infer_read.start()

# Waiting for threads to complete
data_load.join()
train_read.join()
infer_read.join()

main_end_time = time.time()

print('[CONSOLE] Operation completed.')
print('[CONSOLE] Total operation time taken is ', str(main_end_time - main_start_time))

[CONSOLE] Operation started.
Loader  self.name='Thread-3'  added datapoint:  50  to data index:  1
Loader  self.name='Thread-3'  added datapoint:  98  to data index:  2
Inferer self.name='Thread-5' read datapoint:  50  from data index:  1
Loader  self.name='Thread-3'  added datapoint:  54  to data index:  3
Loader  self.name='Thread-3'  added datapoint:  6  to data index:  4
Inferer self.name='Thread-5' read datapoint:  98  from data index:  2
Trainer  self.name='Thread-4' read datapoints:  [50, 98, 54]  from data indices:  1  to  3
Loader  self.name='Thread-3'  added datapoint:  34  to data index:  5
Inferer self.name='Thread-5' read datapoint:  54  from data index:  3
Loader  self.name='Thread-3'  added datapoint:  66  to data index:  6
Loader  self.name='Thread-3'  added datapoint:  63  to data index:  7
Loader  self.name='Thread-3'  added datapoint:  52  to data index:  8
Inferer self.name='Thread-5' read datapoint:  6  from data index:  4
Loader  self.name='Thread-3'  added datapo

## Extended Concurrent Python Implementation

Although we have introduced some concurrency, there is still some sequential behaviour within the system that we can further transform concurrently, i.e. when [Trainer Reader] is reading data and when [Inferer Reader] is reading data.

In the current case, the *mutex* semaphore ensures that only one of the processes can run at the time meaning there is some sequential dependency between [Trainer Reader] and [Inferer Reader]. In fact, both these processes are using the same data that is occupied same memory, which may cause conflicts as this can lead to potential race conditions.

One possible way to resolve this is by carefully removing this dependency by introducing an additional memory, such that one memory is shared between only the [Data Loader] and [Trainer Reader] and one memory is shared between only the [Data Loader] and [Inferer Reader].

> **_Note:_** Technically, you can just remove the mutex and remove this sequential dependency entirely. However, as discussed in Lab 4's example, this will allow the data to be corrupted (due to multiple reads/writes happening) and allow the data to be read too early leading to errors and/or stale data (due to no writes).
>
> In this case, while the following solution does not solve problematic race conditions, the intention of this example is to give you an idea on how you can split up the actions of the ML training and ML inference by carefully considering the design of the data propagation, e.g. using two copies of data can allow both training and inference to take place - but, how this impacts the training and inference is still something to explore in your own time.

With this additional change, the flow of our extended concurrent Python code implementation start from [Data Loader] and peform two *choices*, where the first consists of either: 1) [Data Loader] to add more data points or [Trainer] if there is at least 3 data points stored, and the second consists of either 1) [Data Loader] to add more data points or 2) [Inferer] if there is at least 1 data point stored.

Again, with this change in our Python code, it should now resemble this block diagram below.

<center><img src='loader-trainer-inference-block-diagram-concurrent-two-memory.png' /></center>

In [None]:
import time
import random
import threading

random.seed(0)

CAPACITY = 15
DATALOADER_LOAD_THRESHOLD = 1
TRAINER_READ_THRESHOLD = 3
INFERER_READ_THRESHOLD = 1

DATALOADER_WAIT_TIME = 1
TRAINER_WAIT_TIME = 5
INFERER_WAIT_TIME = 2

TEST_MEMORIES_WITH_DIFFERENT_DATA = False

data1 = [-1 for i in range(CAPACITY)]
data2 = [-1 for i in range(CAPACITY)]

mutex1 = threading.Semaphore()
mutex2 = threading.Semaphore()
empty1 = threading.Semaphore(CAPACITY)
empty2 = threading.Semaphore(CAPACITY)
full1 = threading.Semaphore(0)
full2 = threading.Semaphore(0)

# Producer 1 operation -- DataLoader
class DataLoader(threading.Thread):
    def run(self):
        start_time = time.time()
        datapoints_loaded = 0
        in_index = 0
        while datapoints_loaded < CAPACITY:

            time.sleep(DATALOADER_WAIT_TIME)

            empty1.acquire()
            empty2.acquire()
            mutex1.acquire()
            mutex2.acquire()

            datapoint = random.randint(1, 100)
            data1[in_index] = datapoint

            if TEST_MEMORIES_WITH_DIFFERENT_DATA:
                datapoint2 = random.randint(1, 100) # Proof of concept
                data2[in_index] = datapoint2
                print("Loader " , str(f'{self.name=}'), " added datapoint: ", str(datapoint), " to data1 index: ", str(in_index + 1),
                      " and added datapoint: ", str(datapoint2), " to data2 index: ", str(in_index+1))
            else:
                data2[in_index] = datapoint
                print("Loader " , str(f'{self.name=}'), " added datapoint: ", str(datapoint), " to data1 and data2 index: ", str(in_index + 1))

            in_index = (in_index + DATALOADER_LOAD_THRESHOLD) % CAPACITY
            datapoints_loaded += DATALOADER_LOAD_THRESHOLD

            mutex1.release()
            mutex2.release()
            full1.release()
            full2.release()
        end_time = time.time()
        print("Loader " , str(f'{self.name=}'), " completed in ", str(end_time - start_time), " seconds")

# Consumer 1 operation -- Trainer, reads 5 datapoints
# class TrainerReader(threading.Thread):
class TrainerReader(threading.Thread):
    def run(self):
        start_time = time.time()
        read_datapoints_in_trainer = 0
        out_index_trainer = 0
        while read_datapoints_in_trainer < CAPACITY:

            time.sleep(TRAINER_WAIT_TIME)

            t_counter = TRAINER_READ_THRESHOLD
            while t_counter > 0:
                full1.acquire()
                t_counter -= 1
            mutex1.acquire()

            trainer_data_to_load = []
            for i in range(out_index_trainer, out_index_trainer + 3):
                trainer_data_to_load += [data1[i]]
            print("Trainer ", str(f'{self.name=}'), "read datapoints: ", str(trainer_data_to_load), " from data1 indices: ", str(out_index_trainer + 1), " to ", str(out_index_trainer + 3))
            out_index_trainer = (out_index_trainer + TRAINER_READ_THRESHOLD) % CAPACITY
            read_datapoints_in_trainer += TRAINER_READ_THRESHOLD

            mutex1.release()
            for i in range(TRAINER_READ_THRESHOLD):
                empty1.release()
        end_time = time.time()
        print("Trainer " , str(f'{self.name=}'), " completed in ", str(end_time - start_time), " seconds")

# Consumer 2 operation -- Inference, reads a single datapoint
class InfererReader(threading.Thread):
    def run(self):
        start_time = time.time()
        read_datapoints_in_inferer = 0
        out_index_inferer = 0
        while read_datapoints_in_inferer < CAPACITY:

            time.sleep(INFERER_WAIT_TIME)

            i_counter = INFERER_READ_THRESHOLD
            while i_counter > 0:
                full2.acquire()
                i_counter -= 1
            mutex2.acquire()

            inferer_data_to_load = data2[out_index_inferer]
            print("Inferer", str(f'{self.name=}'), "read datapoint: ", str(inferer_data_to_load), " from data2 index: ", str(out_index_inferer + 1))
            out_index_inferer = (out_index_inferer + INFERER_READ_THRESHOLD) % CAPACITY
            read_datapoints_in_inferer += INFERER_READ_THRESHOLD

            mutex2.release()
            for i in range(INFERER_READ_THRESHOLD):
                empty2.release()
        end_time = time.time()
        print("Inferer " , str(f'{self.name=}'), " completed in ", str(end_time - start_time), " seconds")

print('[CONSOLE] Operation started.')

main_start_time = time.time()

# Creating Threads
data_load = DataLoader()
train_read = TrainerReader()
infer_read = InfererReader()

# Starting Threads
data_load.start()
train_read.start()
infer_read.start()

# Waiting for threads to complete
data_load.join()
train_read.join()
infer_read.join()

print("Data 1:\n", data1)
print("Data 2:\n", data2)

main_end_time = time.time()

print('[CONSOLE] Operation completed.')
print('[CONSOLE] Total operation time taken is ', str(main_end_time - main_start_time))

[CONSOLE] Operation started.
Loader  self.name='Thread-6'  added datapoint:  50  to data1 and data2 index:  1
Inferer self.name='Thread-8' read datapoint:  50  from data2 index:  1
Loader  self.name='Thread-6'  added datapoint:  98  to data1 and data2 index:  2
Loader  self.name='Thread-6'  added datapoint:  54  to data1 and data2 index:  3
Inferer self.name='Thread-8' read datapoint:  98  from data2 index:  2
Loader  self.name='Thread-6'  added datapoint:  6  to data1 and data2 index:  4
Trainer  self.name='Thread-7' read datapoints:  [50, 98, 54]  from data1 indices:  1  to  3
Loader  self.name='Thread-6'  added datapoint:  34  to data1 and data2 index:  5
Inferer self.name='Thread-8' read datapoint:  54  from data2 index:  3
Loader  self.name='Thread-6'  added datapoint:  66  to data1 and data2 index:  6
Loader  self.name='Thread-6'  added datapoint:  63  to data1 and data2 index:  7
Inferer self.name='Thread-8' read datapoint:  6  from data2 index:  4
Loader  self.name='Thread-6'  

## Extended Concurrent Python Implementation (Cont.)

Just as a proof of concept, the above code was slightly modified to show it working where the data between Data Loader and Trainer Reader are different to the data between Data Loader and Inferer Reader.

In [None]:
import time
import random
import threading

random.seed(0)

CAPACITY = 15
DATALOADER_LOAD_THRESHOLD = 1
TRAINER_READ_THRESHOLD = 3
INFERER_READ_THRESHOLD = 1

DATALOADER_WAIT_TIME = 1
TRAINER_WAIT_TIME = 5
INFERER_WAIT_TIME = 2

TEST_MEMORIES_WITH_DIFFERENT_DATA = True

data1 = [-1 for i in range(CAPACITY)]
data2 = [-1 for i in range(CAPACITY)]

mutex1 = threading.Semaphore()
mutex2 = threading.Semaphore()
empty1 = threading.Semaphore(CAPACITY)
empty2 = threading.Semaphore(CAPACITY)
full1 = threading.Semaphore(0)
full2 = threading.Semaphore(0)

# Producer 1 operation -- DataLoader
class DataLoader(threading.Thread):
    def run(self):
        start_time = time.time()
        datapoints_loaded = 0
        in_index = 0
        while datapoints_loaded < CAPACITY:

            time.sleep(DATALOADER_WAIT_TIME)

            empty1.acquire()
            empty2.acquire()
            mutex1.acquire()
            mutex2.acquire()

            datapoint = random.randint(1, 100)
            data1[in_index] = datapoint

            if TEST_MEMORIES_WITH_DIFFERENT_DATA:
                datapoint2 = random.randint(1, 100) # Proof of concept
                data2[in_index] = datapoint2
                print("Loader " , str(f'{self.name=}'), " added datapoint: ", str(datapoint), " to data1 index: ", str(in_index + 1),
                      " and added datapoint: ", str(datapoint2), " to data2 index: ", str(in_index+1))
            else:
                data2[in_index] = datapoint
                print("Loader " , str(f'{self.name=}'), " added datapoint: ", str(datapoint), " to data1 and data2 index: ", str(in_index + 1))

            in_index = (in_index + DATALOADER_LOAD_THRESHOLD) % CAPACITY
            datapoints_loaded += DATALOADER_LOAD_THRESHOLD

            mutex1.release()
            mutex2.release()
            full1.release()
            full2.release()
        end_time = time.time()
        print("Loader " , str(f'{self.name=}'), " completed in ", str(end_time - start_time), " seconds")

# Consumer 1 operation -- Trainer, reads 5 datapoints
# class TrainerReader(threading.Thread):
class TrainerReader(threading.Thread):
    def run(self):
        start_time = time.time()
        read_datapoints_in_trainer = 0
        out_index_trainer = 0
        while read_datapoints_in_trainer < CAPACITY:

            time.sleep(TRAINER_WAIT_TIME)

            t_counter = TRAINER_READ_THRESHOLD
            while t_counter > 0:
                full1.acquire()
                t_counter -= 1
            mutex1.acquire()

            trainer_data_to_load = []
            for i in range(out_index_trainer, out_index_trainer + 3):
                trainer_data_to_load += [data1[i]]
            print("Trainer ", str(f'{self.name=}'), "read datapoints: ", str(trainer_data_to_load), " from data1 indices: ", str(out_index_trainer + 1), " to ", str(out_index_trainer + 3))
            out_index_trainer = (out_index_trainer + TRAINER_READ_THRESHOLD) % CAPACITY
            read_datapoints_in_trainer += TRAINER_READ_THRESHOLD

            mutex1.release()
            for i in range(TRAINER_READ_THRESHOLD):
                empty1.release()
        end_time = time.time()
        print("Trainer " , str(f'{self.name=}'), " completed in ", str(end_time - start_time), " seconds")

# Consumer 2 operation -- Inference, reads a single datapoint
class InfererReader(threading.Thread):
    def run(self):
        start_time = time.time()
        read_datapoints_in_inferer = 0
        out_index_inferer = 0
        while read_datapoints_in_inferer < CAPACITY:

            time.sleep(INFERER_WAIT_TIME)

            i_counter = INFERER_READ_THRESHOLD
            while i_counter > 0:
                full2.acquire()
                i_counter -= 1
            mutex2.acquire()

            inferer_data_to_load = data2[out_index_inferer]
            print("Inferer", str(f'{self.name=}'), "read datapoint: ", str(inferer_data_to_load), " from data2 index: ", str(out_index_inferer + 1))
            out_index_inferer = (out_index_inferer + INFERER_READ_THRESHOLD) % CAPACITY
            read_datapoints_in_inferer += INFERER_READ_THRESHOLD

            mutex2.release()
            for i in range(INFERER_READ_THRESHOLD):
                empty2.release()
        end_time = time.time()
        print("Inferer " , str(f'{self.name=}'), " completed in ", str(end_time - start_time), " seconds")

print('[CONSOLE] Operation started.')

main_start_time = time.time()

# Creating Threads
data_load = DataLoader()
train_read = TrainerReader()
infer_read = InfererReader()

# Starting Threads
data_load.start()
train_read.start()
infer_read.start()

# Waiting for threads to complete
data_load.join()
train_read.join()
infer_read.join()

print("Data 1:\n", data1)
print("Data 2:\n", data2)

main_end_time = time.time()

print('[CONSOLE] Operation completed.')
print('[CONSOLE] Total operation time taken is ', str(main_end_time - main_start_time))

[CONSOLE] Operation started.
Loader  self.name='Thread-9'  added datapoint:  50  to data1 index:  1  and added datapoint:  98  to data2 index:  1
Inferer self.name='Thread-11' read datapoint:  98  from data2 index:  1
Loader  self.name='Thread-9'  added datapoint:  54  to data1 index:  2  and added datapoint:  6  to data2 index:  2
Loader  self.name='Thread-9'  added datapoint:  34  to data1 index:  3  and added datapoint:  66  to data2 index:  3
Inferer self.name='Thread-11' read datapoint:  6  from data2 index:  2
Loader  self.name='Thread-9'  added datapoint:  63  to data1 index:  4  and added datapoint:  52  to data2 index:  4
Trainer  self.name='Thread-10' read datapoints:  [50, 54, 34]  from data1 indices:  1  to  3
Loader  self.name='Thread-9'  added datapoint:  39  to data1 index:  5  and added datapoint:  62  to data2 index:  5
Inferer self.name='Thread-11' read datapoint:  66  from data2 index:  3
Loader  self.name='Thread-9'  added datapoint:  46  to data1 index:  6  and add

## How is this all related to today's task?

Let us first recall our tasks from the top (which have been copied for your convenience).

> 1/ Assuming you have created the class for Data Preprocessing (i.e. Class 1), create **two more separate classes**: **one for ML Training** (i.e. Class 2) that uses _chunks of data between Class 1 and Class 2_, and **another for ML Inference** (i.e. Class 3) that uses a _single datapoint between Class 1 and Class 3_.
>
> 2/ Now, assuming that there is no concurrency between Class 1 and Class 2 as well as between Class 1 and Class 3, find a way to introduce **two new shared memories** (one memory between Class 1 and 2, and one memory between Class 1 and 3) with _limited capacity_ like in the sensor-processor co-simulation example. For this, you can assume that **Class 1 is a producer** and that **Classes 2 and 3 are consumers**.

For 1/: The data containing three datapoints represents the 'chunks of data' that is used in the ML training, and similarly the data containing one datapoint represents the 'datapoint' that is used in the ML inference. In fact, you can even scale this up further by assuming that the 'chunks of data' means several sets of _20 datapoints_ which we use to train our LSTM and that the 'single datapoint' means one set of _20 datapoints_ which we use to infer our LSTM. By carefully considering the design of your framework, your implementation should be (somewhat) susceptible to small rigid changes.

For 2/: As we have already shown how the inherent sequential dependency between the trainer and inferer can be slightly alleviated, the key idea for having two memories is that it allows the trainer and inferer to perform their operations without affecting one another, i.e. avoiding the situation where we may have a race condition involving the *mutex semaphore* between the two processes. Of course, this (possibly) will introduce a secondary mutex as the memory between the [Data Loader] and [Trainer Reader], as well as the memory between the [Data Loader] and [Inferer Reader], which must be appropriately handled to avoid further conflicts. But, with this setup, you can essentially slice the data up and feed *individual slices* to **different** processes - these being our ML inference and ML trainer.
<center><img src='loader-trainer-inference-block-diagram-orig-data.png' /></center>