1. Import the standard libraries `random` and `time` as well as the Queue and Thread classes from their respective modules, as shown in the code below.

In [1]:
from queue import Queue
from threading import Thread
import random
import time

We imported the modules that we will use to design our next mock system.

2. Initialise the mock dataset and put it into a queue, as shown below.

In [2]:
urls = ['url1-', 'url1-', 'url2-', 'url3-', 'url4-', 'url5-', 'url6-', 'url7-', 'url8-', 'url9-', 'url10-']
seen = set()

url_queue = Queue()
for url in urls:
    url_queue.put(url)



We created 11 mock URLs and a `seen` set to find duplicates. We then created a queue for our URLs and added each URL to the queue.

3. Set up queues for each of the other components, up to and including the deduplicator, as shown in the code below.

In [3]:
scraped_queue = Queue()
cleaned_queue = Queue()
deduplicated_queue = Queue()

We initialized `Queue()` objects for each component to push to when done.

4. Define the scraper module as a function, as shown below.

In [4]:
def scraper():
    while True:
        time.sleep(random.randrange(0,2))
        url = url_queue.get()
        print("Scraping {}".format(url))
        scraped_queue.put(url[3:])

Our scraper function is designed to be run in a thread, so we have a while true loop. We use `time.sleep()` to simulate work taking a variable amount of time. We get an available URL from the first queue for processing and remove the first three characters, leaving just the number and the trailing hypen.

5. define similar functions for the cleaner and deduplicator components, as shown below.

In [5]:
def cleaner():
    while True:
        time.sleep(random.randrange(2,4))
        raw = scraped_queue.get()
        print("Cleaning {}".format(raw))
        cleaned_queue.put(raw.replace("-", ""))
        
def deduplicator():
     while True:
        time.sleep(random.randrange(4,6))
        cleaned = cleaned_queue.get()
        print("Deduplicating {}".format(cleaned))
        if cleaned not in seen:
            deduplicated_queue.put(cleaned)
            seen.add(cleaned)


We defined functions for our cleaner and deduplicator. They both work very similarly to the `scraper()` function, but the cleaner removes the trailing hyphen and the deduplicator checks if the cleaned version (only the number) has ever been seen before.

6. Initialise threads for each component, as shown in the code below.

In [6]:
scraper_worker = Thread(target=scraper)
cleaner_worker = Thread(target=cleaner)
deduplicator_worker = Thread(target=deduplicator)

We created three threads, one for each component using the `Thread` class and passing the respective functions in using the `target` parameter.

7. Add the threads to a list and start each of them, as shown below.

In [7]:
threads = [
    scraper_worker, cleaner_worker, deduplicator_worker
]

[t.start() for t in threads]

[None, None, None]

Scraping url1-
Scraping url1-
Cleaning 1-
Scraping url2-
Scraping url3-
Scraping url4-
Scraping url5-
Scraping url6-
Scraping url7-
Scraping url8-


We put our three worker threads in an array and called `start()` on each of them. Note that although the scraper works significantly faster than the other two components, the system remains available as the scraper puts all its outputs on a Queue, waiting to be picked up by the Cleaner. If the Cleaner were to break completely, the work would continue queuing up, waiting for it to become available again.