# Task Queues

Task (or job) queues are an architecture of dissecting and streamlining processes with flexibilities in scheduling and concurrency of job executions.

## Sounds familiar?

In fact, from [Part 14 - Distributed Computation](../14-distributed-computation/notebook.ipynb), we have already seen from both Spark and Dask examples that leverage such mechanism. Though they were a bit more abstracted away from us thus feels more implicit.

The subject for this part is the more pronounced and explicit task queue usages. First, let's pick apart the chief components and concepts of a task queue.

## Workers

The workers carry out unit tasks of an end-to-end process. The tasks are dissected in a way that can run in isolation and in incremental steps.

In general, granular tasks are easier to implement, test, and in the context of a task queue, also easier to parallelize and maximize computing resources.

## Queue

The queue holds tasks that are enqueued or scheduled to be executed from arbitrary processes that do not require immediate (synchronous) execution of the tasks. At the same time, workers, spawned through other processes, threads, or coroutines dequeue and execute the tasks from the queue on a schedule, or when there are available computing resources.

## Communication protocol and broker

Typical task queue design does not directly spawn processes (and subprocesses) to manage both the queue data structure, as well as the operation (enqueue/schedule and dequeue/execution) of tasks. It would be a struggle when dealing with uncertainty of the complexity and scale of tasks and results.

Instead, most task queues rely on a relatively agnostic communication broker, such as a high-performance data store suitable for high-frequency read and write operations, to manage the queue data structure, as well as a programming language agnostic serialization protocol (such as JSON) to transmit the task definitions and results through the queue. Some data stores are almost explicitly built for tasks queues, categorized as _Message Queues_, that specialize in inter-process communications.

Such design also allows extensions such as queue monitoring and scaling beyond a single machine much simpler, since most message queues allow transmission across networks. This embrace of the principle of separation of concern allows 

In [1]:
%%time

from tqdm import tqdm
import requests


def fetch_website(url):
    r = requests.get(url)
    return url, r.text

sites = [
    'bbc.com',
    'theguardian.com',
    'dailymail.co.uk',
    'washingtonpost.com',
    'foxnews.com',
    'usatoday.com',
    'wsj.com',
    'nbcnews.com',
]

for domain in tqdm(sites, ncols=100):
    fetch_website(f'https://{domain}')

100%|█████████████████████████████████████████████████████████████████| 8/8 [00:12<00:00,  1.52s/it]

CPU times: user 352 ms, sys: 67.8 ms, total: 419 ms
Wall time: 12.3 s





In [2]:
import tasks



In [3]:
%%time

for domain in tqdm(sites, ncols=100):
    tasks.fetch_website_task.delay(f'https://{domain}')

100%|█████████████████████████████████████████████████████████████████| 8/8 [00:00<00:00, 19.09it/s]

CPU times: user 230 ms, sys: 70 ms, total: 300 ms
Wall time: 421 ms





In [20]:
%%time

for domain in tqdm(sites, ncols=100):
    tasks.fetch_website_task.delay(f'https://{domain}').get()

100%|█████████████████████████████████████████████████████████████████| 8/8 [00:11<00:00,  1.50s/it]

CPU times: user 86 ms, sys: 21.7 ms, total: 108 ms
Wall time: 12 s





In [5]:
r = tasks.add.delay(1, 2)

In [6]:
r.get()

3

In [7]:
url = 'https://bbc.com'

In [8]:
r = tasks.fetch_website_task.delay(url)

In [9]:
r.status

'PENDING'

In [10]:
r.get()[0]

'https://bbc.com'

In [11]:
r.get()[1][:500]

'    <!DOCTYPE html>\n<html class="b-header--black--white b-pw-1280 b-reith-sans-font">\n\n    <head>\n        <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">\n        <meta name="description" content="Breaking news, sport, TV, radio and a whole lot more.\n        The BBC informs, educates and entertains - wherever you are, whatever your age.">\n        <meta name="keywords" content="BBC, bbc.co.uk, bbc.com, Search, British Broadcasting Corporation, BBC iPlayer, BBCi">\n        <title'

In [12]:
from bs4 import BeautifulSoup

soup = BeautifulSoup(r.get()[1], 'html.parser')
print(soup.find('body').prettify())

<body class="wwhp-edition-us">
 <div id="cookiePrompt">
 </div>
 <noscript>
  <p style="position: absolute; top: -999em">
   <img alt="" height="1" src="https://a1.api.bbc.co.uk/hit.xiti?&amp;col=1&amp;from=p&amp;ptag=js&amp;s=598253&amp;p=home.page&amp;x2=[responsive]&amp;x3=[bbc_website]&amp;x4=[en]&amp;x7=[Index-home]&amp;x8=[reverb-3.2.0-nojs]&amp;x11=[HOMEPAGE_GNL]&amp;x12=[GNL_HOMEPAGE]" width="1"/>
  </p>
 </noscript>
 <div class="bbccom_display_none" id="bbccom_interstitial_ad">
 </div>
 <div class="bbccom_display_none" id="bbccom_interstitial">
 </div>
 <div class="bbccom_display_none" id="bbccom_wallpaper_ad">
 </div>
 <div class="bbccom_display_none" id="bbccom_wallpaper">
 </div>
 <header aria-label="BBC" id="orb-banner" role="banner">
  <div class="orb-nav-pri orb-nav-pri-white orb-nav-empty" dir="ltr" id="orb-header">
   <div class="orb-nav-pri-container b-r b-g-p">
    <div class="orb-nav-section orb-nav-blocks">
     <a href="https://www.bbc.co.uk">
      Homepage
     

In [13]:
from pprint import pprint

titles = set([
    a.text.strip()
    for a in soup.find_all('a')
    if a.get('href', '').startswith('/') and 'media__link' in a['class']
])
pprint(titles)

{"'Beauty' of island's shipwrecks graveyard captured",
 "'Beauty' of island's shipwrecks graveyard...",
 'AI breakthrough could spark medical revolution',
 "Africa's top shots: Red carpet moments and a leap of faith",
 'An Olympics like no other - Tokyo preview',
 'Australia and NZ pull out of World Cup',
 "Australian Olympic boss in 'mansplaining'...",
 'Car burglar caught with his pants down',
 'Could there be new Amy Winehouse music on the way?',
 "Covid contact testing 'cuts school absences'",
 "Cuba sanctions 'just the beginning', says Biden",
 'Drone footage shows scale of China floods...',
 'Great Mosque of al-Nuri destroyed',
 "Gun owners' fears after firearms dealer data breach",
 'Halima Aden: It’s not just about diverse...',
 'In pictures: Germany grapples with flood aftermath',
 'Indian millionaire embroiled in porn scandal',
 'Lions furious with TMO choice',
 'Major websites hit by global outage',
 "Meet the 12-year-old becoming Tokyo's...",
 'Moment mum saves five-year-ol

In [14]:
import fasttext
from gensim.utils import simple_preprocess

model = fasttext.load_model('./opinion.bin')



In [15]:
res = []
for title in titles:
    label, score = model.predict(' '.join(simple_preprocess(title)))
    res.append([
        url,
        title,
        label[0].replace('__label__', ''),
        score[0],
    ])

In [16]:
pprint(res)

[['https://bbc.com',
  'Drone footage shows scale of China floods...',
  'fact',
  0.9970049858093262],
 ['https://bbc.com',
  'Car burglar caught with his pants down',
  'fact',
  0.8839978575706482],
 ['https://bbc.com',
  "Rare 'lightning rainbows' captured during storms",
  'fact',
  0.9995425939559937],
 ['https://bbc.com',
  'The cost of hosting the Olympics',
  'opinion',
  0.753002405166626],
 ['https://bbc.com',
  'Moment mum saves five-year-old from kidnap...',
  'fact',
  0.8754306435585022],
 ['https://bbc.com',
  'AI breakthrough could spark medical revolution',
  'fact',
  0.9989733695983887],
 ['https://bbc.com',
  'Indian millionaire embroiled in porn scandal',
  'fact',
  1.0000029802322388],
 ['https://bbc.com',
  'Major websites hit by global outage',
  'fact',
  0.9951294660568237],
 ['https://bbc.com',
  'Wolf Alice aiming for Mercury Prize double',
  'fact',
  0.9457973837852478],
 ['https://bbc.com',
  "Nasa probe determines Mars' internal structure",
  'fact',
 

In [17]:
# retain in DB
import sqlite3

with sqlite3.connect('news.db') as db:
    db.execute('''
        CREATE TABLE IF NOT EXISTS news(
            url TEXT,
            title TEXT,
            label TEXT,
            score REAL
        );
    ''')
    db.executemany('''
        INSERT INTO news (url, title, label, score)
        VALUES (?, ?, ?, ?);
    ''', res)

In [18]:
import pandas as pd

with sqlite3.connect('news.db') as db:
    df = pd.read_sql('SELECT * FROM news;', con=db)

In [19]:
df

Unnamed: 0,url,title,label,score
0,https://bbc.com,Meet the 12-year-old becoming Tokyo's...,fact,0.988641
1,https://bbc.com,"Cuba sanctions 'just the beginning', says Biden",opinion,0.671030
2,https://bbc.com,An Olympics like no other - Tokyo preview,fact,0.996553
3,https://bbc.com,Richarlison inspires Brazil as Spain falter - ...,fact,0.730434
4,https://bbc.com,No breakthrough on NI trade rules after PM's call,fact,1.000009
...,...,...,...,...
69,https://bbc.com,An Olympics like no other - Tokyo preview,fact,0.996553
70,https://bbc.com,Travelling to an Olympic Games like no other,opinion,0.835654
71,https://bbc.com,Lions furious with TMO choice,fact,0.640842
72,https://bbc.com,The refugee athlete who lost his mum to Covid,opinion,0.741669
