In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
%load_ext autotime

time: 355 µs (started: 2021-03-09 17:51:17 +01:00)


In [3]:
%cd ..

/Users/rubenbroekx/Documents/Projects/NumberBatchWrapper
time: 1.21 ms (started: 2021-03-09 17:51:17 +01:00)


# Usage

This notebook shows the basic usage of the *Number Batch Wrapper* model, used to encode textual data into multilingual word or sentence embeddings.

## Load Wrapper and data

Load in the *Number Batch Wrapper* model together with the dummy data used in this demonstration. The data is a list of 1000 English sentences.

In [4]:
# Load the demo-data
import json
from pathlib import Path

with open(Path.cwd() / 'demos/data.json', 'r') as f:
    data = json.load(f)
    
print(f"Example sentence: \n'{data[0]}'")

Example sentence: 
'He added that people should not mess with mother nature , and let sharks be .'
time: 2.13 ms (started: 2021-03-09 17:51:17 +01:00)


It is possible to create your own text cleaner and tokenizer functions:
 - **text cleaning** is used when looking up the words in the NumberBatch dictionary
 - **tokenizing** is performed to split a given sentence up in words, which are cleaned and looked up in the NumberBatch dictionary
 
Note that text cleaning is performed during model initialisation as well. This implies that if a different text cleaning function is provided to the model without re-running the `initialise` method again, some hazards might occur.

In [5]:
%%capture
!pip install fold-to-ascii stop-words

time: 953 ms (started: 2021-03-09 17:51:17 +01:00)


In [6]:
import re
from typing import List
from fold_to_ascii import fold
from stop_words import get_stop_words

def clean(x:str) -> str:
    """Custom cleaning function."""
    x = fold(x.lower())
    x = re.sub(r'[^a-z]', '', x)
    return x

STOP_EN = set(get_stop_words('en'))

def tokenize(sentence:str) -> List[str]:
    """Custom tokenizer."""
    sentence = re.split(r'\W', sentence)
    sentence = [clean(w) for w in sentence]
    sentence = [w for w in sentence if len(w) > 1 and w not in STOP_EN]
    return sentence

sentence = "This is an example sentence!"
print(f"Cleaning and tokenizing: '{sentence}'")
print(f" ==> '{tokenize(sentence)}'")

Cleaning and tokenizing: 'This is an example sentence!'
 ==> '['example', 'sentence']'
time: 7.23 ms (started: 2021-03-09 17:51:18 +01:00)


Set up the Number Batch Wrapper model using our custom cleaning and tokenization functions. Other default values (not shown) include:
 - `en_fallback` whether or not to fallback to English if word not found. Not applicable for this use-case since the default language is English.
 - `normalise` whether or not to normalise the resulting sentence embeddings.
 - `level` segmentation depth of the files, which is explained in more detail in the `2_performance.ipynb` notebook. 

In [7]:
from number_batch_wrapper import Wrapper

wrapper = Wrapper(
    language='en',
    path=Path.home() / 'numberbatch',
    clean_f=clean,
    tokenizer=tokenize,
)

time: 157 ms (started: 2021-03-09 17:51:18 +01:00)


Initialse the model, if this is not done before. For every configuration, `initialise` should be run only once your machine, since the results are cached under the folder specified by `wrapper.path`. The `inp_path` parameters specifies the folder where the Number Batch data is stored, or where to download the Number Batch data to. Note that this file is rather big (~3GB), so the download might take a while.

In [8]:
if not wrapper.is_initialised():
    wrapper.initialise(
        inp_path=Path.home() / 'Downloads'
    )

Extracting 'en'..: 9161913it [01:12, 126545.97it/s]                             
Segmenting 'en'..: 0it [00:57, ?it/s]


time: 2min 11s (started: 2021-03-09 17:51:18 +01:00)


## Encode and analyse

Use the wrapper to encode the dummy sentences.

In [9]:
results = wrapper(data)
results.shape

Embedding: 100%|██████████| 1000/1000 [00:07<00:00, 140.96it/s]


(1000, 300)

time: 7.37 s (started: 2021-03-09 17:53:30 +01:00)


For demonstration purposes, we search for the two most similar sentences as specified by their cosine similarity.

In [10]:
%%capture
!pip install scikit-learn

time: 1.1 s (started: 2021-03-09 17:53:37 +01:00)


In [11]:
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

similarity = cosine_similarity(results)
np.fill_diagonal(similarity, 0)

time: 556 ms (started: 2021-03-09 17:53:38 +01:00)


In [12]:
best_sim = similarity.argmax()
idx1 = best_sim // len(results)
idx2 = best_sim % len(results)

print(f"Two most similar sentences (cosine similarity of {round(similarity[idx1,idx2],5):.5f})")
print(f" - {data[idx1]}")
print(f" - {data[idx2]}")

Two most similar sentences (cosine similarity of 0.77776)
 - General and administrative expenses on a consolidated basis increased 24 % to approximately $ 2.7 million ( vs. approximately $ 2.2 million ) due to higher employee costs , an increase in the for doubtful retail accounts primarily from one hotel that was damaged by Hurricane Paloma last year , and higher professional fees .
 - Income tax expense increased $ 32.0 million during the first quarter of 2010 compared to 2009 . The effective tax rate increased to 35.2 percent in the first quarter of 2010 compared to 33.3 percent in 2009 , reflecting the higher effective rate associated with the Folgers business and the net favorable resolution of previously open tax positions in 2009 as compared to 2010 .
time: 1.66 ms (started: 2021-03-09 17:53:39 +01:00)
