In [1]:
import numpy

In [2]:
import scipy

In [3]:
import gensim
gensim.utils.lemmatize("The quick brown fox jumps over the lazy dog!")

AttributeError: module 'gensim.utils' has no attribute 'lemmatize'

In [None]:
import nltk
nltk.download('brown')

import textblob
textblob.TextBlob("The quick brown fox jumps over the lazy dog!").noun_phrases

If the above executes without errors, you'll see a number appear to the left of each of these cell prompts, and you're good to go!

In case you're using [virtual evironments](http://docs.python-guide.org/en/latest/dev/virtualenvs/) (recommended), check that the right package/location was picked up by Python:

In [None]:
print(scipy.__version__, scipy.__file__)
print(gensim.__version__, gensim.__file__)
scipy.show_config()

## Check training data

Make sure you have downloaded all necessary data files (again, see the [README](https://github.com/piskvorky/topic_modeling_tutorial/)):

In [None]:
!ls -lh ./data/

You should see at least two entries there: `simplewiki-20140623-pages-articles.xml.bz2` and `20news-bydate.tar.gz`.

## Quick Python recap

#### Data streaming, generators, iterators

Generators are a built-in way to iterate over a sequence **once**, without materializing all its elements at the same time:

In [None]:
def odd_numbers():
    """
    Yield one odd number after another.
    
    Don't try to materialize its result in plain list, with `list(odd_numbers)`,
    because the sequence is infinite and you'll run out of RAM!
    """
    result = 1
    while True:
        yield result  # `yield` instead of `return`!
        result += 2

odd_numbers_generator = odd_numbers()

for odd_number in odd_numbers_generator:
    print(odd_number)
    if odd_number > 10:
        break

We'll be using this pattern of "generate a data point, process it, forget it" often, because it allows us to bypass RAM limitations. With generators we can process huge text corpora in constant memory, using clever algorithms that don't mind operating one-data-point-at-a-time.

This is in contrast to plain Python lists, Pandas frames or even NumPy and SciPy arrays, where the entire sequence must be known beforehand and mapped into (virtual) memory fully.

Generators and iterators come at a cost: since we're only allowed to go one item after another, it's not possible to skip to the middle of the sequence. Unless we take care of it manually, there's no equivalent of randomly accessing an arbitrary element ala `list`s: `some_list[100]` will work, but `some_generator[100]` won't.

An iterable is like a generator (memory efficient), except it can be iterated over **multiple times**. To achieve that, we override the object's special `__iter__` method (which Python calls every time we loop over the object) to return a generator:

In [None]:
class OddNumbers(object):
    def __iter__(self):
        result = 1
        while True:
            yield result
            result += 2

odd_numbers_iterator = OddNumbers()

for odd_number in odd_numbers_iterator:
    print(odd_number)
    if odd_number > 10:
        break

That's all we need to know for our purposes. For more info, read [Data streaming in Python: generators, iterators, iterables](http://radimrehurek.com/2014/03/data-streaming-in-python-generators-iterators-iterables/), or [Python's documentation for "iterator types"](https://docs.python.org/2/library/stdtypes.html#iterator-types).

#### NumPy & SciPy arrays

NumPy is a 3rd party package (not built-in). **NumPy arrays are a concise and efficient way to represent a fixed-length list of numbers** (or, actually and uninterestingly for this tutorial, of any objects). Their power comes from pithy array slicing, even in multiple dimensions:

In [None]:
# create a 2D table of random numbers, with 10 rows and 5 columns
x = numpy.random.rand(10, 5)

print(x)

In [None]:
# print element in 3rd row and 2nd column
print(x[2, 1])  

In [None]:
# print the entire 3rd row
print(x[2])

In [None]:
# print the entire 2nd column
print(x[:, 1])

In [None]:
# print a sub-table (rectangular region), starting at [0, 0] and ending at [4, 2] (exclusive)
print(x[:4, :2])

and the fact that the underlying implementation is written to be fast (in C, even plugging into fast BLAS where available).

Similarly, the **3rd part SciPy package contains `scipy.sparse` arrays**, which are a way to represent vectors and matrices with assumed (implicit) zeros.

`scipy.sparse` arrays are not as efficient as NumPy arrays, because they don't plug into BLAS and because their memory access patterns are more involved (cache misses). But not materializing the zeros explicitly can make a huge difference for very sparse arrays (lots of zeros). However, all non-zero values must still reside in memory, so ultimately, for large data, we still resort to generators and data streaming.

A common pattern that we'll be using is **combining the efficiency of in-memory arrays** (numpy, scipy.sparse) with the **scalability of data streaming**. Instead of processing one document at a time (slow), or all documents at once (non-scalable), we'll be reading **a chunk of documents** into RAM (= as many documents as RAM allows), processing this chunk, then throwing it away and streaming a new chunk into RAM.

### Itertools

A [built-in Python library](https://docs.python.org/2/library/itertools.html) for efficient work data streams (iterables, iterators, generators):

In [None]:
import itertools

infinite_stream = OddNumbers()

# compute the first 10 items (and no more) & print them
print(list(itertools.islice(infinite_stream, 10)))

# lazily concatenate streams; the result is also infinite
concat_stream = itertools.chain('abcde', infinite_stream)
print(list(itertools.islice(concat_stream, 10)))

numbered_stream = enumerate(infinite_stream)  # also infinite
print(list(itertools.islice(numbered_stream, 10)))

# etc; see the itertools docs for more examples

The examples above show another useful pattern: take a small sample of the stream (e.g. the first ten elements) and convert them into plain Python list, with `list(islice(stream, 10))`. To convert an entire stream into list, simply `list(stream)` (watch out for RAM here though, especially with infinite streams!). Nothing beats the simplicity of `list(stream)` for debugging purposes.

## Notebooks

At any point, you can save a notebook to disk by pressing `CTRL`+`s` (or `CMD`+`s`). This will **save all changes you've made to the notebook**, including cell outputs, locally to your disk.

To discard your notebook changes, simply checkout the notebook file again from git (or extract it again from the repository ZIP archive). This will reset the notebook to its original state, **losing all your changes**.