# Task Queues

Task (or job) queues are an architecture of dissecting and streamlining processes with flexibilities in scheduling and concurrency of job executions.

Sounds familiar? In fact, from [Part 14 - Distributed Computation](../14-distributed-computation/notebook.ipynb), we have already seen from both Spark and Dask examples that leverage such mechanism. Though they were a bit more abstracted away from us thus feel more implicit.

The subject for this part is the more pronounced and explicit task queue usages. First, let's pick apart the chief components and concepts of a task queue.

## Workers

The workers carry out unit tasks of an end-to-end process. The tasks are dissected in a way that can run in isolation and incremental steps.

In general, granular tasks are easier to implement, test, and in the context of a task queue, also more accessible to parallelize and maximize computing resources.

## Queue

The queue holds tasks that are enqueued or scheduled to be executed from arbitrary processes that do not require immediate (synchronous) execution of the tasks. At the same time, workers spawned through other processes, threads, or coroutines dequeue and execute the tasks from the queue on a schedule or when there are available computing resources.

## Communication protocol and broker

Typical task queue design does not directly spawn processes (and subprocesses) to manage both the queue data structure and the operation (schedule and execution) of tasks. It would be a struggle when dealing with the uncertainty of the complexity and scale of tasks and results.

Instead, most task queues rely on a relatively agnostic communication broker to manage the queue data structure and a programming language agnostic serialization protocol (such as JSON) to transmit the task definitions and results through the queue. Some data stores are almost explicitly built for tasks queues, categorized as _Message Queues_, specializing in inter-process communications.

Such design also allows extensions, such as queue monitoring and scaling beyond a single machine, since most message queues provide standalone access and transmission across networks.

![task-queue](https://user-images.githubusercontent.com/2837532/126832454-82a4a8e9-34a0-4ccc-9c39-83248a32be16.png)

In [69]:
%%time

from tqdm import tqdm
import requests


def fetch_website(url):
    r = requests.get(url)
    return url, r.text

sites = [
    'bbc.com',
    'theguardian.com',
    'washingtonpost.com',
    'foxnews.com',
    'wsj.com',
]
data = {}
for domain in tqdm(sites, ncols=100):
    data[domain] = fetch_website(f'https://{domain}')

100%|█████████████████████████████████████████████████████████████████| 5/5 [00:11<00:00,  2.31s/it]

CPU times: user 179 ms, sys: 22.7 ms, total: 202 ms
Wall time: 11.6 s





In [19]:
from bs4 import BeautifulSoup

soup = BeautifulSoup(data['bbc.com'][1], 'html.parser')
print(soup.find('body').prettify())

<body class="wwhp-edition-us">
 <div id="cookiePrompt">
 </div>
 <noscript>
  <p style="position: absolute; top: -999em">
   <img alt="" height="1" src="https://a1.api.bbc.co.uk/hit.xiti?&amp;col=1&amp;from=p&amp;ptag=js&amp;s=598253&amp;p=home.page&amp;x2=[responsive]&amp;x3=[bbc_website]&amp;x4=[en]&amp;x7=[Index-home]&amp;x8=[reverb-3.2.0-nojs]&amp;x11=[HOMEPAGE_GNL]&amp;x12=[GNL_HOMEPAGE]" width="1"/>
  </p>
 </noscript>
 <div class="bbccom_display_none" id="bbccom_interstitial_ad">
 </div>
 <div class="bbccom_display_none" id="bbccom_interstitial">
 </div>
 <div class="bbccom_display_none" id="bbccom_wallpaper_ad">
 </div>
 <div class="bbccom_display_none" id="bbccom_wallpaper">
 </div>
 <header aria-label="BBC" id="orb-banner" role="banner">
  <div class="orb-nav-pri orb-nav-pri-white orb-nav-empty" dir="ltr" id="orb-header">
   <div class="orb-nav-pri-container b-r b-g-p">
    <div class="orb-nav-section orb-nav-blocks">
     <a href="https://www.bbc.co.uk">
      Homepage
     

In [64]:
def bbc(tag):
    return tag.name == 'a' and 'media__link' in tag.get('class', []) and tag.get('href', '').startswith('/') and tag.text.strip()

def guardian(tag):
    return tag.name == 'a' and tag.get('data-link-name') == 'article' and tag.text.strip()

def wp(tag):
    return tag.name == 'span' and tag.parent.name == 'a' and tag.text.strip()

def fox(tag):
    return tag.name == 'a' and tag.parent.name == 'h2' and 'title' in tag.parent.get('class') and tag.text.strip()

def wsj(tag):
    return any(['headline' in cls for cls in tag.get('class', [])]) and tag.text.strip()

In [20]:
from pprint import pprint

titles = set([
    a.text.strip()
    for a in soup.find_all('a')
    if a.get('href', '').startswith('/') and 'media__link' in a['class']  # bbc.com rule
])
pprint(titles)

{"'Beauty' of island's shipwrecks graveyard...",
 "'It's so nice to get back into the festival vibe'",
 'AI breakthrough could spark medical revolution',
 "Africa's top shots: Eid prayers and sea dips",
 'An Olympics like no other - Tokyo preview',
 "Australian Olympic boss in 'mansplaining'...",
 'Car burglar caught with his pants down',
 'China’s president makes first Tibet visit as leader',
 'Covid: Moderna jab approved for teenagers in EU',
 "Critics split on Sixth Sense director's new horror",
 'Drone footage shows scale of China floods...',
 'Great Mosque of al-Nuri destroyed',
 'Hundred: Exciting climax to Birmingham v London match',
 'In pictures: Germany grapples with flood aftermath',
 'Inside a US fire truck driving through a wildfire',
 'Key obtained to unlock files from cyber attack',
 'Man Utd announce £73m Sancho signing',
 "Meet the 12-year-old becoming Tokyo's...",
 'Moment mum saves five-year-old from kidnap...',
 'Mother dies after saving baby from China floods',
 'O

In [22]:
# repeat the process for theguardian.com
soup = BeautifulSoup(data['theguardian.com'][1], 'html.parser')
titles = list(set([
    a.text.strip()
    for a in soup.find_all('a', {'data-link-name': True})
    if a['data-link-name'] == 'article'
]))
pprint(titles)

['I was sick, tired and had lost myself – until I began lifting weights at 71',
 'Get in touch  Share a story with the Guardian',
 'Ben Ryan Limits on swoosh mob hinder change for the better',
 'From the agencies  Gaza’s silent children',
 'Coronavirus live: Indonesia reports record daily deaths; Philippines bans '
 'children from going out',
 'Live  Coronavirus: Indonesia reports record daily deaths; Philippines bans '
 'children from going out',
 '‘They’re a little crazy’  The ultramarathon runners crossing Death Valley – '
 'in a drought',
 'A Covid commentary from found images',
 'At least 112 dead in Maharashtra state',
 'Athens appoints chief heat officer to combat climate crisis',
 'Justice ministry brands local Bellingcat reporting partner as ‘foreign '
 'agent’',
 'Friday’s best photos  Pop-up beaches and doggy pools',
 "Football  Former Haitian FA refs' chief banned for life by Fifa over sexual "
 'abuse',
 'Mountain of Salt  A Covid commentary from found images',
 '‘Anything

In [23]:
soup = BeautifulSoup(data['dailymail.co.uk'][1], 'html.parser')
titles = list(set([
    a.text.strip()
    for a in soup.find_all('a', {'itemprop': True})
    if a['itemprop'] == 'url'
]))
pprint(titles)

["That's another fine mess you've gotten me into! Mother elephant ends up "
 'stuck in the mud alongside her two children after trying to drag them out... '
 'before the whole family is rescued',
 'Police launch desperate hunt for missing schoolboy, 15, who vanished more '
 'than 24 hours ago',
 'PM Boris Johnson rallies Team GB athletes over video call... but officials '
 "are increasingly concerned that British athletes 'pinged' as close contacts "
 'to Covid positives will have to isolate for 14 DAYS',
 "Sydney's Covid outbreak is declared a 'national emergency' as Australia's "
 'largest city sees record cases despite month-long lockdown',
 'BBC receives 133 complaints as Scottish and Welsh fans complain that its '
 'coverage of the Euro 2020 final was too pro-English',
 'What every young person who fears the jab MUST be told: Vaccine expert ANGUS '
 'DALGLEISH dismantles beliefs that have seen rates stall among the 18-30s',
 'Firefighters face disciplinary action after blow-up sex

In [67]:
soup = BeautifulSoup(data['wsj.com'][1], 'html.parser')
titles = list(set([
    tag.text.strip()
    for tag in soup.find_all(wsj)
]))
pprint(titles)

['Page Not Found',
 'What Parents With Unvaccinated Kids Need to Know About the Delta Variant '
 'This Summer',
 'Video Shows Demolition of Miami-Area Condo Building',
 'How the EV Industry Is Trying to Fix Its Charging Bottleneck',
 '404',
 'Some Vaccinated People Are Dying of Covid-19. Here’s Why Scientists Aren’t '
 'Surprised.',
 "We can’t find the page you're looking for. If you typed the URL into your "
 'browser, check that you entered it correctly. If you reached this page via '
 'our site or search, please let us know by emailing support@wsj.com',
 'Watch Chinese Astronauts’ First Spacewalk Outside New Space Station',
 'JPMorgan, Goldman Call Time on Work-From-Home. Their Rivals Are Ready to '
 'Pounce.']


In [78]:
import extract_tasks as tasks

In [71]:
r = tasks.add.delay(5, 6)

In [72]:
r.status

'SUCCESS'

In [73]:
r.get()

11

In [89]:
%%time

sites = [
    'bbc.com',
    'theguardian.com',
    'washingtonpost.com',
    'foxnews.com',
    'wsj.com',
]
data = {}
for domain in tqdm(sites, ncols=100):
    data[domain] = tasks.fetch_website_task.delay(f'https://{domain}')
    print(data[domain].status)

100%|████████████████████████████████████████████████████████████████| 5/5 [00:00<00:00, 509.61it/s]

PENDING
PENDING
PENDING
PENDING
PENDING
CPU times: user 10.7 ms, sys: 3.85 ms, total: 14.6 ms
Wall time: 13.5 ms





## Pipelining

Task queues implementations sometimes come with pipelining (or chaining) capability, to allow unit tasks to be arranged as a pipeline with logical order. The scheduling mechanism manages available workers to take on unit tasks as they become available, while maintaining the order of data flow between unit tasks in a given pipeline.

![pipeline-pool](https://user-images.githubusercontent.com/2837532/128251647-c7cbd989-4f04-4104-9c9b-059083b041e2.png)

In [1]:
# pipeline, server, and stuff code

In [3]:
# revise pipeline to include opinion scores, demonstrate shutting down pipeline safely while the tasks taken in the interim are safely stored (a good architecture)

In [4]:
# maybe revise pipeline to format the payload as slack block?