Ok, but doing part-of-speech tagging is both slow (it uses a neural network underneath!) and memory intensive. Very memory intensive.

In [None]:
# with open("../data/signpost_corpus.txt", "r") as f:
#     signpost_tagged_pos = nltk.pos_tag(nltk.word_tokenize(f.read())[:10000])

In [None]:
# del signpost_tagged_pos

Watching Task Manager says that this op took `0.28 GB` of memory, whilst processing just 1/445th of the total data. That means that processing the entire corpus (assuming linear performance) would take `~124 GB`, obvious absurd.

Ok, so don't do it directly. Got it.

Trying to read the tokens into a numpy array also didn't work, unfortunately, raising a memory error. A `np.array(tokens[:1000000])` object (representing <25% of the data) knocks out `~2 GB` of memory. I'm very clearly reaching the limits of my hardware.

In [2]:
import nltk

with open("../data/signpost_corpus.txt", "r", encoding="utf8") as f:
    sentences = nltk.sent_tokenize(f.read())

In [3]:
sentences[:5]

["The Association of Members' Advocates, an independent association involved in the dispute resolution process, has started discussions on having a new election for the position of AMA Coordinator.",
 'Since the initial election of Alex756 to this position last April was for a term of six months, it appears the new election is already several months overdue.',
 'A poll on the question of having new elections was running strongly in favor, but general discussion among the members was a little more uncertain how to proceed.',
 'Voicing an outside opinion on the AMA, Ambi said she thought it was "completely dead" and hadn\'t seen it in action anywhere.',
 "Some advocates responded that their work hadn't been done in the later stages of the dispute resolution system, noting that they seemed mostly to be helping newcomers who weren't familiar with Wikipedia procedures."]

In [4]:
len(sentences)

172472

In [5]:
.2*172472/10000

3.44944

The above is a hopeful estimate of how much memory would be required to store (based on some now-deleted tests). I only have `8 GB` of `RAM` available on my Desktop at the moment, I will buy more&mdash;I've clearly hit the limits of my hardware. We'll just take the most recent 100,000 sentences published in the *Signpost* instead.

In [11]:
import numpy as np

sentences_pos = np.concatenate([nltk.pos_tag(nltk.word_tokenize(sentence)) for sentence in sentences[-100000:]])

The bottleneck isn't the classifier itself, which barely uses any memory, it's writing into a `numpy` array afterwards and retaining that. The write process above took `~4 GB` to execute, spiking me to my RAM limit, before falling back down to "just" ~3.6 GB, below.

In [14]:
sentences_pos

array([['The', 'DT'],
       ['Malagasy', 'NNP'],
       ['Wikipedia', 'NNP'],
       ..., 
       ['his', 'PRP$'],
       ['retirement', 'NN'],
       ['.', '.']], 
      dtype='<U171')

In [17]:
sentences_pos.nbytes

3632656968

The `dtype` hints that this huge memory drain is due to extremely long strings messing with the `numpy` stride formatting. I should have seen this coming!

In [21]:
sentences_pos[np.argmax([len(word) for word, _ in sentences_pos])]

array([ '//www.lsi.upc.edu/~tsteiner/papers/2013/mj-no-more-using-concurrent-wikipedia-edit-spikes-with-social-network-plausibility-checks-for-breaking-news-detection-ramss2013.pdf',
       'JJ'], 
      dtype='<U171')

Well:

> The longest word in the Oxford English Dictionary is ... 'Supercalifragilisticexpialidocious', made famous by Mary Poppins, [which] is 34 letters long.

So let's cast to `dtype="<U34"`.

In [27]:
np.array(list(map(lambda arr: arr.astype("|U34") , sentences_pos)))

array([['The', 'DT'],
       ['Malagasy', 'NNP'],
       ['Wikipedia', 'NNP'],
       ..., 
       ['his', 'PRP$'],
       ['retirement', 'NN'],
       ['.', '.']], 
      dtype='<U34')

This works. Let's use this as a workaround from the start.