# Introduction to NLTK, NumPy, SciPy, and MatPlotLib

Practical course material for the ASDM Class 09 (Text Mining) by Florian Leitner.

© 2016 Florian Leitner. All rights reserved.

This lab will cover working with the [Natural Langauge ToolKit](http://www.nltk.org/index.html), with [Numerical and Scientific Python aka. NumPy and SciPy](http://www.numpy.org/), and with [MatPlotLib](http://matplotlib.org/) (while your instructor ensures that all participants have a working notebook environment...). These are maybe the most fundamental libraries to do Computational Linguistics and Scientific Computing using Python.

In [None]:
%pylab inline --no-import-all

If you have not yet done so already, read up on pylab and, in particular, what `--no-import-all` does.

In [None]:
%pylab?

## Natural Language ToolKit

Try to import the NLTK library; If that does not work, you need to install it with `conda install nltk` if you are using Anaconda Python (or `pip3 install nltk` if you are using regular Python).

In [None]:
import nltk
print(nltk.__version__) # should be 3.2.1 or newer

Fetch some metadata (text "copora") for the NLTK; we will use only `book` content.

In [None]:
nltk.download()

This should have opened another window where you can choose what corpora to download. It is sufficient for now to only download the `book` content. Note that downloading everything can take quite a while...

In [None]:
import nltk.book as b

### The Text Object

In [None]:
b.text1

In [None]:
type(b.text1)

Some quick info:

In [None]:
nltk.text.Text?

More in-depth documentation with all methods defined on `Text`:

In [None]:
help(nltk.text.Text)

In [None]:
# create a shortcut to Moby Dick
moby = b.text1
# iteration over all words in Moby Dick using a Python list comprehension
wlens = [len(w) for w in moby]
# show the first ten words
print(" ".join(moby[:10]))
# report the lengths of the first ten words
wlens[:10]

Count the frequency of "me" in Moby Dick

In [None]:
moby.count("me")

In [None]:
moby.findall?

Find all tokens before "Moby" in the text.

In [None]:
moby.findall("<.*><Moby>")

Now find all tokens occuring before "Dick" in the text.

In [None]:
moby.findall("<.*><Dick>")

Use regular expressions (here: only tokens with letters - roughly, "words") to match tokens.

In [None]:
moby.findall("<[A-Za-z]*><Moby>")

Use regular expression grouping `()` to extract only specific tokens that were matched.

In [None]:
moby.findall("(<[A-Za-z]*>)<Moby>")

In [None]:
moby.similar?

In [None]:
moby.similar("boy") # NB: semantic similarity ("synonyms")

In [None]:
len(b.text3) # total tokens

In [None]:
len(set(b.text3)) # unique tokens

In [None]:
wlens[:5]

In [None]:
sum(wlens) / len(wlens) # mean

In [None]:
sum(wlens) * 1.0 / len(wlens)

Finding the median token length:

In [None]:
wlens = sorted(wlens)

In [None]:
wlens[len(wlens) / 2]

The longest token in the book:

In [None]:
wlens[-1]

### Plots available directly from a Text Object

The text object and the tools that follow form an extremely important toolkit of any text miner or computational linguist. They allow us to visually inspect the properties of our text (collection/corpus).

In [None]:
b.text2

In [None]:
b.text3

In [None]:
b.text3.plot?

In [None]:
nltk.probability.FreqDist.plot?

Show the most frequent tokens in a text; note the Zipfian distributions!

In [None]:
plt.figure(figsize=(20,10))
print(len(b.text1))
b.text1.plot(100)

In [None]:
plt.figure(figsize=(20,10))
print(len(b.text2))
b.text2.plot(100)

In [None]:
plt.figure(figsize=(20,10))
print(len(b.text3))
b.text3.plot(100)

Note that the larger the "buldge" on the left and the wider the tail to the right, the larger the semantic richness of the text. Which of the books is semantically richer, then? Would you agree its book 2, "Sense and Sensibility by Jane Austen 1811"?

In [None]:
b.text2.collocations?

**Collocations** are pairs of words that occur together more frequently than would be expected by chance. Therefore, we can assume the pairs have semantics that go beyond the semantic of each individual word.

In [None]:
b.text2.collocations()

In [None]:
b.text2.dispersion_plot?

In a nutshell, plot the occurrence of each word in the text. Lets look at the most frequent proper nouns (names) as we found above in the frequency plot.

In [None]:
b.text2.dispersion_plot(["Elinor", "Marianne", "Edward", "Dashwood", "Willoughby", "Lucy", "Brandon"])

So this allows us to even quickly get a "feel" for the story itself, e.g., here you can easily tell who interacts with whom (or who not)!

### NLTK's Discrete "Frequency" Distributions

A counter for tokens.

In [None]:
from nltk import FreqDist

In [None]:
fdist = FreqDist(txt)

In [None]:
list(fdist.items())[:10]

In [None]:
fdist.max() # the word length with the highest count

In [None]:
fdist["the"] # the count for the token "the" 

In [None]:
fdist.freq("the") # frequency/total_frequency => the probability of the token "the"

In [None]:
example['label1'].plot()

## NumPy and SciPy: Numeric Python

A good NumPy tutorial: [www.engr.ucsb.edu/~shell/che210d/numpy.pdf](http://www.engr.ucsb.edu/~shell/che210d/numpy.pdf)

The API/reference for NumPy and SciPy: [docs.scipy.org/doc/](http://docs.scipy.org/doc/)

In [None]:
# already done for us by the %pylab setup at the beginning:
# import numpy as np

### NumPy Arrays as Vectors

In [None]:
a = np.arange(10)
a

In [None]:
a.shape

In [None]:
b = np.arange(10, 20)
b

In [None]:
a * b

In [None]:
a < b # NumPy arrays generally behave simlarly to R vectors

In [None]:
c = np.concatenate((a, b)) # NB that the argument to np.concatenate() is a SINGLE LIST
c

In [None]:
a[a > 5]

In [None]:
a.dot(b)

In [None]:
np.where(a > 5, "greater", "lesser")

In [None]:
a.view?

In [None]:
if 5 in a:
    print("hurray")

In [None]:
5.0 in a # "a" is a list of integers, "5.0" is a float

### Array Statistics

Let's convert our `wlens` list from the NLTK part to a NumPy array:

In [None]:
wlarr = np.array(wlens)

In [None]:
wlarr.sum()

In [None]:
wlarr.mean()

In [None]:
wlarr.var()

In [None]:
wlarr.std()

In [None]:
wlarr.min()

In [None]:
wlarr.max()

In [None]:
wlarr.argmax?

In [None]:
wlarr[wlarr.argmax()]

In [None]:
wlarr.sort()

In [None]:
wlarr[len(wlarr)/2]

In [None]:
np.median(wlarr) # NB: median is not a method defined on array!

### Arrays as Matrices

In [None]:
a = np.array([[0, 1], [2, 3]], float)
b = np.array([2, 3], float)
c = np.array([[1, 1], [4, 0]], float)
a

In [None]:
a.shape

In [None]:
b * a

In [None]:
b.dot(a)

In [None]:
a.dot(c)

In [None]:
np.cross(a, b)

In [None]:
np.inner(a, b)

In [None]:
np.outer(a, b)

### Linear Algebra Functions

In [None]:
np.linalg.det(a)

In [None]:
np.linalg.eig(a)

In [None]:
np.linalg.inv(a)

In [None]:
np.linalg.svd(a)

### Advanced Linear Algrebra

In [None]:
from scipy import linalg as la

In [None]:
la.svd(a)

In [None]:
%timeit np.linalg.svd(a)

In [None]:
%timeit la.svd(a)

Other important functionalty provided by SciPy modules:

* `constants` - a huge list of math/phys constants
* `integrate` - numerical integration/ODEs
* `optimize` - optimization functions (e.g., BFGS is here)
* `sparse` - working with large, sparse matrices
* `interpolate` - interpolation for discrete data (linear and spline)
* `fftpack` - Fast Fourier transform routines
* `signal` - Signal processing (time-series analysis)
* **`stats`** - a huge collection of distribtuions and functions

### Statistical Functions

Three variables, four instances:

In [None]:
data = np.array([[1, 2, 1, 3], [5, 3, 1, 8], [2, 1, 5, 6]], float)

In [None]:
np.corrcoef(data)

In [None]:
np.cov(data)

In [None]:
from scipy import stats # by the way, "RV" means "random variable"

In [None]:
stats.norm.pdf?

In [None]:
d = stats.norm.pdf([-1.0, 0.0, 1.0])
d

In [None]:
d.min()

In [None]:
d.max()

In [None]:
d.mean()

For more info on working with distributions, have a look at the SciPy stats tutorial at [docs.scipy.org/doc/scipy/reference/tutorial/stats.html](http://docs.scipy.org/doc/scipy/reference/tutorial/stats.html)

# Simple Barcharts with MatPlotLib

`matplotlib` is slightly over-enginered, particularly if you are used to R's `ggplot`: Just plotting a barchart for a dictionary of word counts can be a challenge for beginners. Here's how...

In [None]:
d = {'word': 3, 'other': 7, 'another': 2, 'hello': 5, 'plot': 4}

In [None]:
fig = plt.figure()
ax = fig.add_subplot(111)
rng = np.arange(len(d))
words, counts = zip(*d.items())
ax.bar(rng, counts)
ax.set_xticks(rng + 0.4)
ax.set_xticklabels(words)
fig.autofmt_xdate()

Let's implement this as a function:

In [None]:
def wordchart(counts):
    fig = plt.figure()
    ax = fig.add_subplot(111)
    rng = np.arange(len(counts))
    words, counts = zip(*counts.items())
    ax.bar(rng, counts)
    ax.set_xticks(rng + 0.4)
    ax.set_xticklabels(words)
    fig.autofmt_xdate()

In [None]:
wordchart(d)