In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
%load_ext autotime

time: 314 µs (started: 2021-03-09 20:52:01 +01:00)


In [3]:
%cd ..

/Users/rubenbroekx/Documents/Projects/NumberBatchWrapper
time: 1.35 ms (started: 2021-03-09 20:52:01 +01:00)


# Performance

This notebook evaluates the performance of the model for different `level` configureations. The general rule of thumb is that a higher `level` will result in a faster performance, at the cost of more files to create (longer initialisation).

In [4]:
from number_batch_wrapper import Wrapper

time: 150 ms (started: 2021-03-09 20:52:01 +01:00)


In [5]:
from time import time

TIME = {}
def start(tag):
    assert tag not in TIME
    TIME[tag] = time()
    
def stop(tag):
    assert tag in TIME
    TIME[tag] = time() - TIME[tag]
    
def str_time(time) -> str:
    result = ""
    if time // 3600: 
        result += f"{round(time)//3600}h "
        time %= 3600
    if time // 60: 
        result += f"{round(time)//60}m "
        time %= 60
    if time // 1:
        result += f"{round(time)//1}s "
        time %= 1
    result += f"{round(time,5):.5f}"
    return result
        
def print_time_overview():
    print(f"Time overview:")
    for k,v in sorted(TIME.items()):
        print(f" - {k}: {str_time(v)}")

time: 1.75 ms (started: 2021-03-09 20:52:01 +01:00)


In [6]:
# Load the demo-data, use to evaluate performance on
import json
from pathlib import Path

with open(Path.cwd() / 'demos/data.json', 'r') as f:
    data = json.load(f)

time: 2.38 ms (started: 2021-03-09 20:52:01 +01:00)


In [7]:
# Custom cleaning and tokenization, see notebook 1_usage.ipynb
import re
from typing import List
from fold_to_ascii import fold
from stop_words import get_stop_words

def clean(x:str) -> str:
    """Custom cleaning function."""
    x = fold(x.lower())
    x = re.sub(r'[^a-z]', '', x)
    return x

STOP_EN = set(get_stop_words('en'))

def tokenize(sentence:str) -> List[str]:
    """Custom tokenizer."""
    sentence = re.split(r'\W', sentence)
    sentence = [clean(w) for w in sentence]
    sentence = [w for w in sentence if len(w) > 1 and w not in STOP_EN]
    return sentence

time: 7.14 ms (started: 2021-03-09 20:52:01 +01:00)


## Time

Time both the creation (`initialise`) as the inference (`__call__`) of the model. 

In [8]:
for level in range(1,11):
    wrapper = Wrapper(
        language='en',
        path=Path.home() / f'numberbatch/{level}',
        clean_f=clean,
        tokenizer=tokenize,
        level=level
    )
    
    start(f"init_{level}")
    wrapper.initialise(
        inp_path=Path.home() / 'Downloads'
    )
    stop(f"init_{level}")
    
    start(f"call_{level}")
    results = wrapper(data)
    stop(f"call_{level}")
    
print_time_overview()

Extracting 'en'..: 9161913it [01:09, 132336.33it/s]                             
Segmenting 'en'..: 0it [00:39, ?it/s]
Embedding: 100%|██████████| 1000/1000 [28:19<00:00,  1.70s/it] 
Extracting 'en'..: 9161913it [01:06, 137141.55it/s]                             
Segmenting 'en'..: 0it [00:39, ?it/s]
Embedding: 100%|██████████| 1000/1000 [04:36<00:00,  3.62it/s]
Extracting 'en'..: 9161913it [01:05, 139138.79it/s]                             
Segmenting 'en'..: 0it [00:41, ?it/s]
Embedding: 100%|██████████| 1000/1000 [00:45<00:00, 21.90it/s]
Extracting 'en'..: 9161913it [01:10, 130530.09it/s]                             
Segmenting 'en'..: 0it [00:52, ?it/s]
Embedding: 100%|██████████| 1000/1000 [00:14<00:00, 71.22it/s]
Extracting 'en'..: 9161913it [01:17, 118541.20it/s]                             
Segmenting 'en'..: 0it [01:04, ?it/s]
Embedding: 100%|██████████| 1000/1000 [00:08<00:00, 120.04it/s]
Extracting 'en'..: 9161913it [01:18, 116060.87it/s]                             
Segment

Time overview:
 - call_1: 28m 20s 0.50459
 - call_10: 5s 0.50755
 - call_2: 4m 36s 0.39052
 - call_3: 46s 0.70036
 - call_4: 14s 0.15349
 - call_5: 9s 0.59546
 - call_6: 6s 0.24971
 - call_7: 5s 0.79914
 - call_8: 5s 0.81513
 - call_9: 4s 0.40859
 - init_1: 1m 50s 0.92526
 - init_10: 3m 16s 0.59860
 - init_2: 1m 47s 0.30791
 - init_3: 1m 49s 0.13731
 - init_4: 2m 4s 0.74781
 - init_5: 2m 23s 0.04795
 - init_6: 2m 38s 0.51560
 - init_7: 2m 48s 0.18359
 - init_8: 3m 14s 0.70625
 - init_9: 3m 13s 0.32836
time: 59min 30s (started: 2021-03-09 20:52:01 +01:00)





## Visualise

Visualise the measured results.