Term Co-Occurrence Matrix
========================
A key part of caterpillar is the creation of datastructures to make text analytics a lot more accessible. One of the key datastructures required to do this is the term co-occurrence matrix.

Before we get onto the specifics of the co-occurrence matrix data structure, it's worth giving some background context on caterpillar. At its core, caterpillar is an text search (or information retrieval) library just like Apache Lucene. It indexes documents in a configurable way, making them searchable. Where caterpillar starts to differ from Lucene (and the core reason for its existence) is around analytics. Caterpillar:

* Breaks documents up into context blocks (called frames or text excerpts). By default, there are blocks of 2 sentences that don't span paragraph breaks, but they are configurable).
* A term co-occurrence matrix is created using term co-occurrence at the **context block** level.
* Frequencies of terms are required at the context block level. **Note:** that a term can only appear once in a context block.

This notebook experiments with the best way to create and store the term co-occurrence matrix given a list of terms for each context block.

Key Considerations
------------------
There is a few key characteristics of how this data structured is used that should be kept in mind when implementing a solution:

1. Access of individual elements from the matrix should be possible by both indexes (`matrix[0][10]`) and label (`matrix['term1']['term2']`).
2. The following characteristics of the data structure are used by Kapiche to derive topics: 
  * degree of co-occurrence for a term (the number of dimensions for a term vector / matrix row that are > 0)
  * the co-occurrence between two terms (values of the matrix).
3. The structure needs to be serialised to disk and accessed incrementally. It shouldn't be necessary to load the entire matrix into memory just to access 1 value. Or, put another way, the on-disk representation should be indexed. Ideally, it should be indexed in such a way that an entire row or a single item can be fetched of disk directly. This is so we can support incremental processing of a matrix.
4. We should never exceed a configurable amount of memory when building the matrix. This will probably be around 1GB.
5. Because of the nature of term co-occurrences, this matrix will probably be sparse.
6. NumPy, SciPy and Cython are all acceptable solutions. Code should be Python 2 & 3 compliant.

Outside of this, the number 1 consideration is to use the smallest amount of memory and consume the least amount of time as practically possible 😆

In [1]:
import numpy as np
from scipy.sparse import lil_matrix, dok_matrix, csr_matrix
from collections import defaultdict, OrderedDict
from random import randint, choice
from pprint import pprint as pp
from array import array
from itertools import product, combinations
from time import perf_counter
from copy import copy
from time import sleep

# Setup (Metrics)

In [2]:
import psutil
import os

def mem():
    """ Use psutil to record memory snapshot. """
    pid = os.getpid()
    p = psutil.Process(pid)
    rss, vms = p.memory_info()
    return vms

class Stats:
    """ Context manager for reporting memory change and time cost. """
    def __enter__(self):
        self.m1 = mem()
        self.t1 = perf_counter()
        
    def __exit__(self, type, value, traceback):
        self.t2 = perf_counter()
        self.m2 = mem()
        print('\nChange in memory: ', end='')
        print('{:.4g} MB'.format((self.m2 - self.m1) / 1024 / 1024))
        print('Time cost (s)   : ', end='')
        print('{:.4g} s\n'.format(self.t2 - self.t1))
    
# Demo!
with Stats():
    x = [0]*100000000  # 100M
    
with Stats():
    del x


Change in memory: 762.9 MB
Time cost (s)   : 0.5079 s


Change in memory: -762.9 MB
Time cost (s)   : 0.2511 s



# Create the list of context blocks

In [3]:
words = [_.strip() for _ in open('/usr/share/dict/words', 'r')]
print('Number of unique words: %d\n' % len(words))
words = words[:61000]  # Truncate the list to be more realistic

Number of unique words: 235886



### Create our own hash for bidirectional lookups

In [4]:
# Give word, get index
# This is the opposite of `words`: give index, get word
wordsd = OrderedDict(zip(words, range(len(words))))

In [5]:
# Test
x = words[1234]
print(x)
print(wordsd[x])
words[32751]

acetonization
1234


'Ceratitis'

## Dataset creation

In [6]:
def make_context_blocks(num_blocks=100000, word_count=(5, 20)):
    context_blocks = []
    for i in range(num_blocks):
        block_size = choice(range(*word_count))
        #
        # Pretty important that `sorted` is called here. This makes 
        # combinations stable later.
        #
        block = sorted(set(choice(words) for i in range(block_size)))
        context_blocks.append(block)
    return context_blocks
             
with Stats():
    context_blocks = make_context_blocks()
    
print('Sample blocks:')
for b in context_blocks[:5]:
    print(' - ', '/'.join(b))


Change in memory: 18.41 MB
Time cost (s)   : 2.133 s

Sample blocks:
 -  Cromwell/alumni/aphonous/appraiser/archiepiscopacy/assorter/chalcographist/cisterna/clavolae/clubwood/colostomy/contradictory/discutient/dispondaic/dottle/downstairs/effusiveness/ekaboron/elatedness
 -  acidness/acinarious/adjudgment/authorizer/awearied/bonesetter/cancerophobia/carvyl/centrosymmetry/ditolyl/duplicative
 -  Ampelidae/Antaiva/abrenounce/afternote/autographer/caramba/colloquialness
 -  aspiring/bonedog/catalase/chalastic/decastyle/dynamiting/ecumenicalism
 -  Batrachophidia/analcitite/anopluriform/arctation/auntship/burnover/caprimulgine/celation/circumumbilical/deadhouse/disaccommodation


### Build a version of context_blocks that is only arrays of arrays

This changes the context blocks, i.e. the list of lists of 5-40 strings, into a two-dimensional array of integers. Each integer is an index into the `idx` hash that was built earlier.

The array **is preallocated** for both rows and columns.  Currently we're using a default of 100 for columns.  This easily covers the 5-40 band, obviously.  We use "-1" as default, and this is used to know which entries are valid words and which are not.

In [40]:
def make_cb_array(context_blocks, max_words_per_block=100):
    # Note: values are initialized to -1.  This is to keep track of 
    # which entries are valid. These will be >=0, and will index into
    # the `words` list.
    context_blocks_array = np.zeros(
        (len(context_blocks), max_words_per_block), 
        dtype='i4') - 1
    context_block_lengths = np.zeros(len(context_blocks), dtype='u2')
    for i, block in enumerate(context_blocks):
        for j, word in enumerate(block):
            # wordsd is a reverse lookup. You give the word, it tells
            # you the index in the "words" array.
            context_blocks_array[i, j] = wordsd[word]
        context_block_lengths[i] = len(block)
    return context_blocks_array, context_block_lengths

# Store the length of each row in a separate array
context_blocks_array, context_block_lengths = make_cb_array(context_blocks)

In [41]:
# Demo
print(context_blocks_array[500])
for i in range(context_block_lengths[500]):
    print(words[context_blocks_array[500, i]], end=', ')

[45744  7523 18040 26204 52017 59468    -1    -1    -1    -1    -1    -1
    -1    -1    -1    -1    -1    -1    -1    -1    -1    -1    -1    -1
    -1    -1    -1    -1    -1    -1    -1    -1    -1    -1    -1    -1
    -1    -1    -1    -1    -1    -1    -1    -1    -1    -1    -1    -1
    -1    -1    -1    -1    -1    -1    -1    -1    -1    -1    -1    -1
    -1    -1    -1    -1    -1    -1    -1    -1    -1    -1    -1    -1
    -1    -1    -1    -1    -1    -1    -1    -1    -1    -1    -1    -1
    -1    -1    -1    -1    -1    -1    -1    -1    -1    -1    -1    -1
    -1    -1    -1    -1]
Crocodylidae, anacoluthic, ballot, bricky, despotat, ectodermoidal, 

# Naive `dict` method.  Dicts inside Dicts

(Also, this is all working with strings.  See further down for using naive dicts but with integers everywhere.)

In [9]:
def method_dict(context_blocks):
    """
    Given a list of blocks (each containing 5-40 words), build a dict that 
    itself contain dicts. the inner dict has a count of the number of associations
    between the outer key and the inner key.
    """
    d = defaultdict(lambda: defaultdict(int))
    for block in context_blocks:
        for w1, w2 in combinations(block, 2):
            d[w1][w2] += 1
    return d

with Stats():
    d = method_dict(context_blocks)
    
associations = sum(len(w2s) for w1, w2s in d.items())
print('associations: ', associations)


Change in memory: 480 MB
Time cost (s)   : 4.2 s

associations:  7535974


Using `setdefault` all over the place is slower, but really not by much.

In [10]:
def method_dict2(context_blocks):
    """
    Given a list of blocks (each containing 5-15 words), build a dict that 
    itself contain dicts. the inner dict has a count of the number of associations
    between the outer key and the inner key.
    """
    d = {}
    for block in context_blocks:
        for w1, w2 in combinations(block, 2):
            d.setdefault(w1, {})
            d[w1].setdefault(w2, 0)
            d[w1][w2] += 1
    return d

with Stats():
    d2 = method_dict(context_blocks)
    
associations = sum(len(w2s) for w1, w2s in d2.items())
print('associations: ', associations)


Change in memory: 404.8 MB
Time cost (s)   : 4.551 s

associations:  7535974


In [11]:
# Show a sample of the resulting dict.
for i, (w1, w2s) in enumerate(d.items()):
    if i > 5:
        break
    print(w1)
    for j, w2 in enumerate(w2s):
        if j > 5:
            break
        print(' '*8, '{:20} {:10}'.format(w2, d[w1][w2]))

arisard
         bebaste                       1
         burro                         1
         centauric                     1
         boxkeeper                     1
         begray                        1
         corruptor                     1
compaction
         cow                           1
         ditchdigger                   1
         corporator                    1
         electrotonicity               1
         deteriority                   1
         condensable                   1
applicatively
         dicaryophase                  1
         bibliophilism                 1
         disdainfulness                1
         backaching                    1
         archminister                  1
         disreputably                  1
bookkeeping
         complacency                   1
         countersubject                1
         dispensative                  1
         coproduce                     1
         disinsulation                 1
         cal

# Using a counter

In [12]:
from collections import Counter

def method_counter(context_blocks):
    c = Counter()
    for block in context_blocks:
        c.update(combinations(block, 2))
    return c

with Stats():
    cnt = method_counter(context_blocks)
print('Associations:',len(cnt))
print()


Change in memory: 1017 MB
Time cost (s)   : 3.657 s

Associations: 7535974



In [13]:
for iter, ((w1, w2), c) in enumerate(cnt.items()):
    print('{:15}{:15}{:4}'.format(w1, w2, c))
    if iter>5:
        break

alloquy        condivision       1
athyroidism    collusion         1
Cerdonian      bullwhip          1
Celsius        ebullioscopy      1
cinerea        deposition        1
chondroprotein cudgerie          1
coccochromatic cytostome         1


# Cython (naive) - Also using dicts

In [14]:
%load_ext cython

In [15]:
%%cython -a

import numpy as np

def method_cython1(list context_blocks):
    """
    Given a list of blocks (each containing 5-40 words), build a dict that 
    itself contain dicts. the inner dict has a count of the number of associations
    between the outer key and the inner key.
    """
    #cdef int n = int(100e6)
    #cdef unsigned int[:] w1 = np.zeros(n, dtype='u4') - 1
    #cdef unsigned int[:] w2 = np.zeros(n, dtype='u4') - 1
    cdef int end = 0, i, j, blen
    cdef list block
    cdef dict out = {}, inner
    cdef str w1, w2
    for block in context_blocks:
        blen = len(block)
        for i in range(blen):
            w1 = block[i]
            inner = out.get(w1) or {}
            for j in range(i+1, blen):
                w2 = block[j]
                if not w2 in inner:
                    inner[w2] = 0
                inner[w2] += 1
            out[w1] = inner
    return out

In [16]:
with Stats():
    d = method_cython1(context_blocks)
    
associations = sum(len(w2s) for w1, w2s in d.items())
print('associations: ', associations)


Change in memory: 184 MB
Time cost (s)   : 2.413 s

associations:  7535974


# Numpy

A quick demo of how to use the integer version of the context blocks.

In [17]:
# Take on particular block
a = context_blocks_array[500]
# Take only the assigned words from the block (drop "-1"s)
b = a[a>-1]
print('Words in this block:\n\n',b, end='\n'*2)
x = np.zeros(200, dtype='i4')
x[5:5+len(b)] = b
print('Pair combinations of these words:', end='\n'*2)
for _ in list(combinations(b, 2)):
    print(_, end=",")

Words in this block:

 [45744  7523 18040 26204 52017 59468]

Pair combinations of these words:

(45744, 7523),(45744, 18040),(45744, 26204),(45744, 52017),(45744, 59468),(7523, 18040),(7523, 26204),(7523, 52017),(7523, 59468),(18040, 26204),(18040, 52017),(18040, 59468),(26204, 52017),(26204, 59468),(52017, 59468),

### Tools for the numpy work: faster combinations, and `lru_cache`

In [18]:
from scipy.misc import comb
from itertools import chain
from functools import lru_cache

# The basic strategy is to build INDICES of 
# combinations, and then use Numpy's clever
# index assignment to generate the actual 
# combinations arrays.

@lru_cache()
def comb_index(n, k):
    count = comb(n, k, exact=True)
    index = np.fromiter(chain.from_iterable(combinations(range(n), k)), 
                        'i4', count=count*k)
    return index.reshape(-1, k)

def combb(data):
    idx = comb_index(len(data), 2)
    return data[idx]

# It turns out that 2-combinations are efficiently produced via an upper
# triangluar array. Other than that, same as before, we first calculate
# the INDICES array, and then pass that into our data to build the
# actual list of combinations.

@lru_cache()
def comb_index_triu(n, k):
    return np.array(np.triu_indices(n, 1)).T
    
def combtriu(data):
    idx = comb_index_triu(len(data), 2)
    return data[idx]

print('Compare the first few elements of each combinations function:', end='\n\n')
print(combb(b[:3]))
print(combtriu(b[:3]))

Compare the first few elements of each combinations function:

[[45744  7523]
 [45744 18040]
 [ 7523 18040]]
[[45744  7523]
 [45744 18040]
 [ 7523 18040]]


## Basic numpy method.  All arrays, uses fast combinations functions.

In [19]:
import numpy as np
from scipy.misc import comb

def method_numpy1(context_blocks_array, max_words_per_block=40):
    """
    We create one, very long array (many rows) with 2 columns.  Every time
    we add a co-occurence, we simply use a new row to record the two words.
    There are some clever tricks inside the sub methods, mostly about how 
    to work with the combinations efficiently, but basically this pretty 
    much just records every co-occurence in a pretty dumb way.
    
    It turns out this is also quite fast.
    
    Note that we DON'T sum the counts here.  This means that the output 
    array will have duplicated pairs. IOW there will be multiple rows
    with the same two entries.  Afterwards, you will have to sum the
    duplicates to determine the co-occurence counts.
    """
    # Pre-allocation of array: WORST CASE
    p = comb(max_words_per_block, 2)
    n = int(len(context_blocks_array) * p)
    print('Worst-case pre-allocation is {:,} entries.'.format(n))
    co = np.zeros((n, 2), dtype='i4') - 1
    end = 0  # Keep track of position in the allocation array

    for block in context_blocks_array:
        # Combinations of words in this block. (m, 2) array
        new_entries = combtriu(block[block>-1])  
        # Copy the new associations directly in
        co[end:end+len(new_entries), :] = new_entries
        # Move the "current position" marker
        end += len(new_entries)
    
    # Return an array of the correct size (truncate)
    print('Actual count turned out to be {:,} entries.'.format(end+1))
    return co[:end, :]

### Performance Test

In [20]:
try:
    del co
except:
    pass

with Stats():
    co = method_numpy1(context_blocks_array)
    
associations = len(co)
print('associations: {:,}'.format(associations))

Worst-case pre-allocation is 78,000,000 entries.
Actual count turned out to be 7,551,389 entries.

Change in memory: 595.1 MB
Time cost (s)   : 1.248 s

associations: 7,551,388


### How to use the output?  Use slicing.

In [21]:
# Demo of use
def top_cooccurences(co, word, most_common_count=10):
    """ 
        co: one big array (n x 2).  Each entry is an individual co-occurence.
        word: A word that you want to find the co-occurences for.
        most_common_count: The number of most common co-occurences to return.
        
    You give a word, this function returns the 
    other words most strongly associated with
    it, along with the counts.
    """
    ix = wordsd[word]
    # Find most common pair with "capivi"
    entries_above = co[co[:,0]==ix]
    entries_below = co[co[:,1]==ix]

    single_array = np.concatenate((entries_above[:,1], entries_below[:,0]), axis=0)  
    idx, counts = np.unique(single_array, return_counts=True)
    
    other_words = [words[idx[_]] for _ in range(most_common_count)]
    return other_words, counts[:most_common_count]

In [22]:
top_cooccurences(co, 'capivi', 3)

(['abashlessly', 'abigailship', 'abodement'], array([1, 1, 1]))

To find the most common associations in the entire result, you would have to build a sparse array to count them.

**Note that the act of building the sparse array will also count duplicate entries automatically. It's doing some of our work for us basically.**

In [23]:
def make_sparse(co):
    return csr_matrix(
            (np.ones(co.shape[0], dtype='u4'), (co[:,0], co[:,1])),
            dtype='u4')

m = make_sparse(co)
m.shape

(60994, 61000)

Now we can query the top counts across the entire array quite easily.

In [24]:
# All entries with a cooccurence > 2
# The two arrays returned are indexes for each dimension.
m[m>2].nonzero()

(array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]),
 array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11]))

If you're **only interested in rows**, you could also sum the array across the columns and see what comes up.

In [25]:
sums = m.sum(axis=1)
max_count_index = sums.argmax()
max_count_value = sums[max_count_index, 0]
print('Word with the biggest count is {} with {}.'.format(max_count_index, max_count_value))
print('(That word is {})'.format(words[max_count_index]))

Word with the biggest count is 9211 with 460.
(That word is Ansel)


What if you want to find the top 10 **ROWS**?

In [26]:
def top_rows(m: "sparse array", count=10):
    sums = m.sum(axis=1).ravel()
    #print(sums.shape, sums)
    indices = np.argsort(sums, 1)
    #print(indices.shape, indices)
    indices = indices[0, -count:]
    #print(indices.shape, indices)
    #print(indices[0, -2])
    for ix in range(-1, -count-1, -1):
        i = indices[0, ix]
        print('{:20} : {:<}'.format(words[i], sums[0,i]))
        
# Demo
print('The top 2:')
print('==========')
top_rows(m, 2)
print()
print('The top 10:')
print('===========')
top_rows(m, 10)

The top 2:
Ansel                : 460
Amaranthaceae        : 450

The top 10:
Ansel                : 460
Amaranthaceae        : 450
Azorian              : 448
Akali                : 431
Cellepora            : 425
Asteroidea           : 422
Bimini               : 416
Armenian             : 415
Apriline             : 415
Apotactici           : 413


# Very large test

In [79]:
with Stats():
    new_cb = make_context_blocks(num_blocks=int(2e5), word_count=(5,40))
    
with Stats():
    new_cba, new_cba_lengths = make_cb_array(new_cb)

print(len(new_cb), len(new_cba))


Change in memory: -397.6 MB
Time cost (s)   : 9.219 s


Change in memory: 152.3 MB
Time cost (s)   : 1.897 s

200000 200000


In [80]:
try:
    del co2
except:
    pass

with Stats():
    co2 = method_numpy1(new_cba)
    
print('Associations  : ','{0:,}'.format(len(co2)))
print('Size of result: {:,.2f} MB'.format(co2.nbytes/1024/1024))
    

Worst-case pre-allocation is 156,000,000 entries.
Actual count turned out to be 56,443,921 entries.

Change in memory: 1190 MB
Time cost (s)   : 3.104 s

Associations  :  56,443,920
Size of result: 430.63 MB


### Check out the top 5 rows

In [81]:
with Stats():
    m = make_sparse(co2)
    
print('Size of sparse matrix: {:,.2f} MB'.format(m.data.nbytes/1024/1024))
print()
print('Top 5 rows (words):')
print()
    
with Stats():
    top_rows(m, 5)


Change in memory: 430.6 MB
Time cost (s)   : 4.302 s

Size of sparse matrix: 212.08 MB

Top 5 rows (words):

Arabophil            : 2740
Addy                 : 2717
Anamnionata          : 2627
Agatha               : 2622
Alopecias            : 2578

Change in memory: 0.6875 MB
Time cost (s)   : 0.2726 s



# What about `dict` but with ints and our numpy tools?

The results are pretty bad, surprisingly so.  Needs more investigation to figure out why.

In [82]:
def method_dict_int(context_blocks_array):
    """
    Given a list of blocks (each containing 5-40 words), build a dict that 
    itself contain dicts. the inner dict has a count of the number of associations
    between the outer key and the inner key.
    
    THIS ONE USES INTEGERS EVERYWHERE.
    """
    d = defaultdict(lambda: defaultdict(int))
    for block in context_blocks_array:
        for w1, w2 in combtriu(block[block>-1]):
            d[w1][w2] += 1
    return d

with Stats():
    d = method_dict_int(context_blocks_array)
    
associations = sum(sum(w2s.values()) for w1, w2s in d.items())
print('associations: ', associations)
print(len(context_blocks_array))


Change in memory: -104 MB
Time cost (s)   : 17.95 s

associations:  7551388
100000


# Sparse

In [83]:
import numpy as np
from scipy.misc import comb

def method_sparse(context_blocks_array, max_words_per_block=40, max_section_length=int(1e7)):
    """
    Series of sparse matrix constructions.
    
    max_section_length is a setting.  Tweak to trade-off CPU vs RAM.
    """    
    # Max combinations possible in each block
    p = comb(max_words_per_block, 2)     
    
    # Buffers 
    ones = np.ones(max_section_length, dtype='u2')
    co = np.zeros((max_section_length, 2), dtype='u2')
    end = 0  # Keep track of position in the allocation array 
    
    # The max number of unique words.  Might need to go up.
    # Sets num rows and cols for the output sparse matrix
    ns = 2**16  # (65536) 
    # Output. Stores co-occurrence totals between word pairs.
    # The datatype determines the max count possible, and also the 
    # memory cost of the sparse matrix.  'u2' is quite aggressively
    # small. u4 shouldn't be much worse.
    m = csr_matrix((ns, ns), dtype='u2')  # 
    
    for block in context_blocks_array:
        # Combinations of words in this block.
        new_entries = combtriu(block[block>-1])  
        #new_entries = combb(block[block>-1]) 
        # Copy the new associations directly in
        co[end:end+len(new_entries), :] = new_entries
        # Move the "current position" marker
        end += len(new_entries)
        # Buffer might be full
        full = end > max_section_length - p  # Account for next iteration fill-up, worst case
        if full:
            m += csr_matrix((ones[:end], (co[:end, 0], co[:end, 1])), (ns, ns))
            end = 0 # Reset back to start
    
    if end > 0:
        m += csr_matrix((ones[:end], (co[:end, 0], co[:end, 1])), (ns, ns))
    return m    
    
try:
    del m
except:
    pass

print('Length of context_blocks_array:',len(context_blocks_array))
with Stats():
    m = method_sparse(context_blocks_array[:100000], max_section_length=int(1e7))
    
print('Total co-occurences: {:,}'.format(m.sum()))
print('Size of sparse matrix: {:,.2f} MB'.format(m.data.nbytes/1024/1024))

Length of context_blocks_array: 100000

Change in memory: 228.6 MB
Time cost (s)   : 1.382 s

Total co-occurences: 7,551,388
Size of sparse matrix: 14.37 MB


### Using the sparse array

In [84]:
with Stats():
    top_rows(m)

Ansel                : 460
Amaranthaceae        : 450
Azorian              : 448
Akali                : 431
Cellepora            : 425
Asteroidea           : 422
Bimini               : 416
Armenian             : 415
Apriline             : 415
Apotactici           : 413

Change in memory: 26.58 MB
Time cost (s)   : 0.04783 s



## Try out the big one

In [85]:
try:
    del m
except:
    pass

print('Length of context_blocks_array:',len(new_cba))
with Stats():
    m = method_sparse(new_cba, max_section_length=int(1e6))
print('Total co-occurences: {:,}'.format(m.sum()))
print('Size of sparse matrix: {:,.2f} MB'.format(m.data.nbytes/1024/1024))

Length of context_blocks_array: 200000

Change in memory: 292.1 MB
Time cost (s)   : 12.9 s

Total co-occurences: 56,443,920
Size of sparse matrix: 106.04 MB


### Memory is great but it's a bit on the slow side.

We can increase the buffer size, reducing the number of times a sparse matrix has to be built internally.  Let's search for the optimum.

In [86]:
try:
    del m
    #sleep(0)
except:
    pass

print('Length of context_blocks_array:',len(new_cba))
for i in range(1,11):
    size = int(i*1e7)
    print('*********************')
    print('Buffer size: {:,}'.format(size))
    print('*********************')
    with Stats():
        m = method_sparse(new_cba, max_section_length=size)
    print('Total co-occurences: {:,}'.format(m.sum()))
    print('Size of sparse matrix: {:,.2f} MB'.format(m.data.nbytes/1024/1024))

Length of context_blocks_array: 200000
*********************
Buffer size: 10,000,000
*********************

Change in memory: 629.3 MB
Time cost (s)   : 6.736 s

Total co-occurences: 56,443,920
Size of sparse matrix: 106.04 MB
*********************
Buffer size: 20,000,000
*********************

Change in memory: 243.1 MB
Time cost (s)   : 6.538 s

Total co-occurences: 56,443,920
Size of sparse matrix: 106.04 MB
*********************
Buffer size: 30,000,000
*********************

Change in memory: 419.9 MB
Time cost (s)   : 6.632 s

Total co-occurences: 56,443,920
Size of sparse matrix: 106.04 MB
*********************
Buffer size: 40,000,000
*********************

Change in memory: -0.2773 MB
Time cost (s)   : 6.897 s

Total co-occurences: 56,443,920
Size of sparse matrix: 106.04 MB
*********************
Buffer size: 50,000,000
*********************

Change in memory: -372.4 MB
Time cost (s)   : 7.177 s

Total co-occurences: 56,443,920
Size of sparse matrix: 106.04 MB
******************

Looks like we get our best timings with a buffer length of 3e7.

# Sparse + Cython

The sparse option seems to work quite well.  Here we'll try to optimize it using Cython.  The main thing is to remove all interaction with the Python runtime inside the inner loops.

### First make some utilities

In [87]:
%%cython -a
# cython: language_level = 3
cimport cython
cimport numpy as np
import numpy as np

@cython.cdivision(True)
cdef inline long comb_count_safe(long n, long k):
    """ Returns the number of combinations for n items in
    groups of k elements. This algorithm won't blow up, unlike
    the factorial math version. """
    cdef long i, prod = 1
    for i in range(k):
        # The bracketing is super-important. Order of operations matters.
        prod = (prod * (n - i)) / (i + 1)
    return prod
    
def comb_cy(int n):
    cdef int i, j
    for i in range(n):
        for j in range(i+1,n):
            print(i,j)
            
cdef inline object comb_cy2(unsigned short n):
    cdef int i, j, row = 0
    cdef unsigned short[:,:] out = np.zeros((comb_count_safe(n,2), 2), dtype='u2')
    for i in range(n):
        for j in range(i+1,n):
            #print(i,j)
            out[row,0] = i
            out[row,1] = j
            row += 1
    return np.array(out)
          
    
def make_lookup_array(unsigned short max_n):
    """ This produces a lookup array for easily generating
    combinations. Instead of calculating the combinations each
    time (for example with `comb_cy()`), instead we PRECALCULATE
    all of the indices for pair combinations of set of max_n
    elements.  Mostly in this notebook at the time of writing, 
    max_n is 40, i.e. 40 words in a context block."""
    cdef int i, j, k, n, row = 0
    cdef int maxcol = comb_count_safe(max_n, 2)
    cdef unsigned short[:,:,:] out1 = np.zeros((max_n+1, maxcol, 2), dtype='u2')
    cdef unsigned short[:] out2 = np.zeros(max_n+1, dtype='u2')
    cdef unsigned short[:,:] tmp
    for i in range(1, max_n + 1):
        n = comb_count_safe(i, 2)
        tmp = comb_cy2(i)
        for j in range(n):
            for k in range(2):
                out1[i, j, k] = tmp[j, k]
        out2[i] = n
    return np.array(out1), np.array(out2)
    
comb_cy(5)
print('safe',comb_count_safe(40,2))

from scipy.misc import comb
print('scipy', comb(40, 2))

print(comb_cy2(5))

In [88]:
x, xi = make_lookup_array(5)
print(x.shape, xi.shape)

for i, r in enumerate(x):
    if i <2:
        continue
    print()
    print('row, ',i)
    print('=========')
    print()
    print(xi[i], r.shape, r)

(6, 10, 2) (6,)

row,  2

1 (10, 2) [[0 1]
 [0 0]
 [0 0]
 [0 0]
 [0 0]
 [0 0]
 [0 0]
 [0 0]
 [0 0]
 [0 0]]

row,  3

3 (10, 2) [[0 1]
 [0 2]
 [1 2]
 [0 0]
 [0 0]
 [0 0]
 [0 0]
 [0 0]
 [0 0]
 [0 0]]

row,  4

6 (10, 2) [[0 1]
 [0 2]
 [0 3]
 [1 2]
 [1 3]
 [2 3]
 [0 0]
 [0 0]
 [0 0]
 [0 0]]

row,  5

10 (10, 2) [[0 1]
 [0 2]
 [0 3]
 [0 4]
 [1 2]
 [1 3]
 [1 4]
 [2 3]
 [2 4]
 [3 4]]


In [89]:
list(combinations(range(5), 2))

[(0, 1),
 (0, 2),
 (0, 3),
 (0, 4),
 (1, 2),
 (1, 3),
 (1, 4),
 (2, 3),
 (2, 4),
 (3, 4)]

In [121]:
%%cython -a
# cython: boundscheck = False
# cython: wraparound = False
cimport cython
cimport numpy as np
import numpy as np
from scipy.misc import comb
from itertools import chain
from functools import lru_cache
from scipy.sparse import csr_matrix

ctypedef int wtype
cdef char* wtype_s = 'i4'

@cython.cdivision(True)
cdef inline long comb_count_safe(long n, long k):
    cdef long i, prod = 1
    for i in range(k):
        # The bracketing is super-important. Order of operations matters.
        prod = (prod * (n - i)) / (i + 1)
        #print(n-i, i+1, prod)
    return prod
            
cdef inline object comb_cy2(wtype n):
    cdef int i, j, row = 0
    cdef wtype[:,:] out = np.zeros((comb_count_safe(n,2), 2), dtype=wtype_s)
    for i in range(n):
        for j in range(i+1,n):
            #print(i,j)
            out[row,0] = i
            out[row,1] = j
            row += 1
    return np.array(out)
          
cpdef make_lookup_array(wtype max_n):
    """ This produces a lookup array for easily generating
    combinations. Instead of calculating the combinations each
    time (for example with `comb_cy()`), instead we PRECALCULATE
    all of the indices for pair combinations of set of max_n
    elements.  Mostly in this notebook at the time of writing, 
    max_n is 40, i.e. 40 words in a context block."""
    cdef int i, j, k, n, row = 0
    cdef int maxcol = comb_count_safe(max_n, 2)
    cdef wtype[:,:,:] out1 = np.zeros((max_n+1, maxcol, 2), dtype=wtype_s)
    cdef wtype[:] out2 = np.zeros(max_n+1, dtype=wtype_s)
    cdef wtype[:,:] tmp
    for i in range(1, max_n + 1):
        n = comb_count_safe(i, 2)
        tmp = comb_cy2(i)
        for j in range(n):
            for k in range(2):
                out1[i, j, k] = tmp[j, k]
        out2[i] = n
    return np.array(out1), np.array(out2)

def method_sparse_cy2(
            int[:, :] context_blocks_array, 
            unsigned short[:] context_block_lengths,
            int max_words_per_block=40, 
            int max_section_length=int(1e7)):
    """
    Series of sparse matrix constructions.
    
    max_section_length is a setting.  Tweak to trade-off CPU vs RAM.
    """   
    x, xi = make_lookup_array(max_words_per_block)
    cdef wtype[:,:,:] lookup = x
    cdef wtype[:] lookup_i = xi
    # Max combinations possible in each block
    cdef int p = comb(max_words_per_block, 2)     
    
    # Buffers 
    cdef np.ndarray ones = np.ones(max_section_length, dtype=wtype_s)
    cdef wtype[:, :] co = np.zeros((max_section_length, 2), dtype=wtype_s)
    cdef wtype[:, :] new_entries = np.zeros((p, 2), dtype=wtype_s)
    cdef wtype[:, :] indices = np.zeros((p, 2), dtype=wtype_s)
    cdef long end = 0  # Keep track of position in the allocation array 
    
    # The max number of unique words.  Might need to go up.
    # Sets num rows and cols for the output sparse matrix
    cdef long ns = 2**16  # (65536) 
    # Output. Stores co-occurrence totals between word pairs.
    # The datatype determines the max count possible, and also the 
    # memory cost of the sparse matrix.  'u2' is quite aggressively
    # small. u4 shouldn't be much worse.
    m = csr_matrix((ns, ns), dtype=wtype_s)  # 

    cdef int i, j, k, cb, nn, num_words, n = context_blocks_array.shape[0]
    cdef int[:] block
    cdef wtype[:,:] pbuffer = np.zeros((p, 2), dtype=wtype_s)
    cdef bint full
    
    # Step over each context_block. Each contains a number of 
    # words between 5 and 40.
    for cb in range(context_blocks_array.shape[0]):
        # Find the number of words in this block
        num_words = context_block_lengths[cb]
        # Based on the NUMBER OF WORDS, use our lookup list to
        # get the pre-calculated array of indices for generating
        # the combinations. To visualize, we get back something like
        # this:
        # [0 1], [0 2], [0 3], [0 4], [1 2]
        # This is used to generate.
        ##indices = lookup[num_words]
        # Now that we have the indices, make the array of actual
        # co-occurrences.
        #with cython.boundscheck(False):
        for i in range(num_words):
            for j in range(i+1, num_words):
                co[end, 0] = context_blocks_array[cb, i]
                co[end, 1] = context_blocks_array[cb, j]
                end += 1
        # Buffer might be full
        full = end > max_section_length - p  # Account for next iteration fill-up, worst case
        if full:
            m += csr_matrix((ones[:end], (np.array(co[:end, 0]), np.array(co[:end, 1]))), (ns, ns))
            end = 0 # Reset back to start
    
    if end > 0:
        #m += csr_matrix((ones[:end], (co[:end, 0], co[:end, 1])), (ns, ns))
        m += csr_matrix((ones[:end], (np.array(co[:end, 0]), np.array(co[:end, 1]))), (ns, ns))
    return m    

In [115]:
try:
    del m
except:
    pass

print('Length of context_blocks_array:',len(context_blocks_array))
print('cba',context_blocks_array.shape, context_blocks_array.dtype)
print('cba_lengths', context_block_lengths.shape, context_block_lengths.dtype)
with Stats():
    #m = method_sparse_cy1(context_blocks_array, max_section_length=int(1e7))
    m = method_sparse_cy2(context_blocks_array, context_block_lengths)
    
print('Total co-occurences: {:,}'.format(m.sum()))
print('Shape of sparse matrix:', m.shape)
print('Size of sparse matrix: {:,.2f} MB'.format(m.data.nbytes/1024/1024))

Length of context_blocks_array: 100000
cba (100000, 100) int32
cba_lengths (100000,) uint16

Change in memory: 0 MB
Time cost (s)   : 0.5155 s

Total co-occurences: 7,551,388
Shape of sparse matrix: (65536, 65536)
Size of sparse matrix: 28.75 MB


In [116]:
m.nonzero()

(array([    0,     0,     0, ..., 60985, 60986, 60993], dtype=int32),
 array([  128,   242,   342, ..., 60993, 60992, 60998], dtype=int32))

In [117]:
"""
Ansel                : 460
Amaranthaceae        : 450
Azorian              : 448
Akali                : 431
Cellepora            : 425
Asteroidea           : 422
Bimini               : 416
Armenian             : 415
Apriline             : 415
Apotactici           : 413
"""

with Stats():
    top_rows(m, 5)

Ansel                : 460
Amaranthaceae        : 450
Azorian              : 448
Akali                : 431
Cellepora            : 425

Change in memory: 1.895 MB
Time cost (s)   : 0.03324 s



In [119]:
try:
    del m
except:
    pass

print('Length of context_blocks_array:',len(new_cba))
with Stats():
    m = method_sparse_cy2(new_cba, new_cba_lengths, max_section_length=int(2e7))
    
print('Total co-occurences: {:,}'.format(m.sum()))
print('Size of sparse matrix: {:,.2f} MB'.format(m.data.nbytes/1024/1024))

Length of context_blocks_array: 200000

Change in memory: 426.8 MB
Time cost (s)   : 4.564 s

Total co-occurences: 56,443,920
Size of sparse matrix: 212.08 MB


# Sparse+Cython: using a lookup list for combinations

In the previous sparse+cython method, the combinations are calculated with nested loops.  We could pre-calculate all the indices for the exact word count in a particular context block, and just iterate through that.  It is unclear whether this will be helpful. The number of items iterated over will be the same.  It's just a question of whether the loop is nested or flat. (The flat one would be the pre-computed terms of the combinations).

In [143]:
%%cython -a
# cython: boundscheck = False
# cython: wraparound = False
cimport cython
cimport numpy as np
import numpy as np
from scipy.misc import comb
from itertools import chain
from functools import lru_cache
from scipy.sparse import csr_matrix
import time

ctypedef unsigned short wtype
cdef char* wtype_s = 'u2'

cdef class benchmark:
    """ Context manager for timings """
    cdef str name
    cdef double start, end
    def __init__(self, str name):
        self.name = name
    def __enter__(self):
        self.start = time.time()
    def __exit__(self,ty,val,tb):
        end = time.time()
        print("%s : %0.3f seconds" % (self.name, end-self.start))
        return False

@cython.cdivision(True)
cdef inline long comb_count_safe(long n, long k):
    """ Calculate the number of combinations of n items in
    groups of k."""
    cdef long i, prod = 1
    for i in range(k):
        # The bracketing is super-important. Order of operations matters.
        prod = (prod * (n - i)) / (i + 1)
        #print(n-i, i+1, prod)
    return prod
            
cdef inline object comb_cy2(wtype n):
    cdef int i, j, row = 0
    cdef wtype[:,:] out = np.zeros((comb_count_safe(n,2), 2), dtype=wtype_s)
    for i in range(n):
        for j in range(i+1,n):
            #print(i,j)
            out[row,0] = i
            out[row,1] = j
            row += 1
    return np.array(out)
          
cdef make_lookup_array(wtype max_n):
    """ This produces a lookup array for easily generating
    combinations. Instead of calculating the combinations each
    time (for example with `comb_cy()`), instead we PRECALCULATE
    all of the indices for pair combinations of set of max_n
    elements.  Mostly in this notebook at the time of writing, 
    max_n is 40, i.e. 40 words in a context block."""
    cdef int i, j, k, n, row = 0
    cdef int maxcol = comb_count_safe(max_n, 2)
    cdef wtype[:,:,:] out1 = np.zeros((max_n+1, maxcol, 2), dtype=wtype_s)
    cdef wtype[:] out2 = np.zeros(max_n+1, dtype=wtype_s)
    cdef wtype[:,:] tmp
    for i in range(1, max_n + 1):
        n = comb_count_safe(i, 2)
        tmp = comb_cy2(i)
        for j in range(n):
            for k in range(2):
                out1[i, j, k] = tmp[j, k]
        out2[i] = n
    return np.array(out1), np.array(out2)

def method_sparse_cy3(
            int[:, :] context_blocks_array, 
            unsigned short[:] context_block_lengths,
            int max_words_per_block=40, 
            int max_section_length=int(1e7)):
    """
    Series of sparse matrix constructions.
    
    max_section_length is a setting.  Tweak to trade-off CPU vs RAM.
    """   
    x, xi = make_lookup_array(max_words_per_block)
    cdef wtype[:,:,:] lookup = x
    cdef wtype[:] lookup_i = xi
    # Max combinations possible in each block
    cdef int p = comb(max_words_per_block, 2)     
    
    # Buffers 
    cdef np.ndarray ones = np.ones(max_section_length, dtype=wtype_s)
    cdef wtype[:, :] co = np.zeros((max_section_length, 2), dtype=wtype_s)
    cdef wtype[:, :] new_entries = np.zeros((p, 2), dtype=wtype_s)
    cdef wtype[:, :] indices = np.zeros((p, 2), dtype=wtype_s)
    cdef long end = 0  # Keep track of position in the allocation array 
    
    # The max number of unique words.  Might need to go up.
    # Sets num rows and cols for the output sparse matrix
    cdef long ns = 2**16  # (65536) 
    # Output. Stores co-occurrence totals between word pairs.
    # The datatype determines the max count possible, and also the 
    # memory cost of the sparse matrix.  'u2' is quite aggressively
    # small. u4 shouldn't be much worse.
    m = csr_matrix((ns, ns), dtype=wtype_s)  # 

    cdef int i, j, k, cb, nn, num_words, n = context_blocks_array.shape[0]
    cdef wtype[:,:] pbuffer = np.zeros((p, 2), dtype=wtype_s)
    cdef bint full
    
    # Step over each context_block. Each contains a number of 
    # words between 5 and 40.
    for cb in range(context_blocks_array.shape[0]):
        # Find the number of words in this block
        num_words = context_block_lengths[cb]
        # Based on the NUMBER OF WORDS, use our lookup list to
        # get the pre-calculated array of indices for generating
        # the combinations. To visualize, we get back something like
        # this:
        # [0 1], [0 2], [0 3], [0 4], [1 2]
        # This is used to generate.
        ##indices = lookup[num_words]
        # Now that we have the indices, make the array of actual
        # co-occurrences.
        #with cython.boundscheck(False):
        for i in range(num_words):
            for j in range(i+1, num_words):
                co[end, 0] = context_blocks_array[cb, i]
                co[end, 1] = context_blocks_array[cb, j]
                end += 1
        # Buffer might be full
        full = end > max_section_length - p  # Account for next iteration fill-up, worst case
        if full:
            with benchmark('inner sparse matrix'):
                m += csr_matrix((ones[:end], (np.array(co[:end, 0]), np.array(co[:end, 1]))), (ns, ns))
            end = 0 # Reset back to start
    
    if end > 0:
        #m += csr_matrix((ones[:end], (co[:end, 0], co[:end, 1])), (ns, ns))
        with benchmark('inner sparse matrix'):
            m += csr_matrix((ones[:end], (np.array(co[:end, 0]), np.array(co[:end, 1]))), (ns, ns))
    return m    

In [144]:
try:
    del m
except:
    pass

print('Length of context_blocks_array:',len(context_blocks_array))
print('cba',context_blocks_array.shape, context_blocks_array.dtype)
print('cba_lengths', context_block_lengths.shape, context_block_lengths.dtype)
with Stats():
    #m = method_sparse_cy1(context_blocks_array, max_section_length=int(1e7))
    m = method_sparse_cy3(context_blocks_array, context_block_lengths)
    
print('Total co-occurences: {:,}'.format(m.sum()))
print('Shape of sparse matrix:', m.shape)
print('Size of sparse matrix: {:,.2f} MB'.format(m.data.nbytes/1024/1024))

Length of context_blocks_array: 100000
cba (100000, 100) int32
cba_lengths (100000,) uint16
inner sparse matrix : 0.528 seconds

Change in memory: -110.7 MB
Time cost (s)   : 0.5948 s

Total co-occurences: 7,551,388
Shape of sparse matrix: (65536, 65536)
Size of sparse matrix: 14.37 MB


### The large case:

In [153]:
try:
    del m
except:
    pass

print('Length of context_blocks_array:',len(new_cba))
with Stats():
    m = method_sparse_cy3(new_cba, new_cba_lengths, max_section_length=int(2e7))
    
print('Total co-occurences: {:,}'.format(m.sum()))
print('Size of sparse matrix: {:,.2f} MB'.format(m.data.nbytes/1024/1024))

Length of context_blocks_array: 200000
inner sparse matrix : 1.292 seconds
inner sparse matrix : 1.593 seconds
inner sparse matrix : 1.428 seconds

Change in memory: 213.4 MB
Time cost (s)   : 4.44 s

Total co-occurences: 56,443,920
Size of sparse matrix: 106.04 MB
