# Exploratory Data Analysis - Zipf's Law

* Does zipf's law hold?
* Analyze word frequencies before and after removing stop words
* Analyze word frequencies before and after stemming/lemmatization

In [None]:
%load_ext autoreload
%autoreload 2
%aimport haikulib.data_utils

%config InlineBackend.figure_format = 'svg'
%matplotlib inline

from collections import Counter
import operator
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns

from nltk.stem import LancasterStemmer, PorterStemmer, SnowballStemmer
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet
import nltk

import spacy

nlp = spacy.load("en", disable=["parser", "ner"])

sns.set()

## Zipf's Law

Simply put, Zipf's law states that the frequencies of words from a natural language corpus are inversely proportional to their rank in a frequency table. That is, a plot of their rank vs frequency on a log-log scale will be roughly linear.

For example, The first word in the table below is twice as frequent as the second word, and three times as frequent as the third.

| rank | value  | occurrences |
|------|--------|-------------|
| 1    | word 1 | 21          |
| 2    | word 2 | 10          |
| 3    | word 3 | 7           |

A plot of this frequency table on a log-log scale is shown below.

In [None]:
ranks = np.array([1, 2, 3])
frequencies = np.array([21, 10, 7])

plt.plot(np.log(ranks), np.log(frequencies))
plt.plot(np.log(ranks), np.log(frequencies), ".")

plt.title("Example of Zipf's Law")
plt.xlabel("$\log(rank)$")
plt.ylabel("$\log(freq)$")
plt.show()

## Zipf's Law for our Dataset

So we get a bag-of-words representation of the dataset, and construct the frequency table.

In [None]:
def get_freq_table(bag, thing="word"):
    """Get a frequency table representation of the given bag-of-<thing> representation."""
    assert isinstance(bag, Counter)
    things, frequencies = zip(*sorted(bag.items(), key=operator.itemgetter(1), reverse=True))
    things = np.array(things)
    frequencies = np.array(frequencies)
    ranks = np.arange(1, len(things) + 1)

    freq_table = pd.DataFrame({"rank": ranks, thing: things, "frequency": frequencies})
#     freq_table.set_index("rank", inplace=True, drop=False)
    return freq_table

In [None]:
bag = haikulib.data_utils.get_bag_of(column="haiku", kind="words")
freq_table = get_freq_table(bag)
freq_table.head()

Plotting the ranks of each word vs their frequency on a log-log scale reveals that Zipf's law does seem to hold for most of the dataset.

In [None]:
plt.plot(np.log(freq_table["rank"]), np.log(freq_table["frequency"]), '.', markersize=3)

plt.title("Haiku Word Frequency")
plt.xlabel("$\log(rank)$")
plt.ylabel("$\log(freq)$")
plt.show()

So then we find the words and their corresponding frequencies at the interesting breaks in the plot.

In [None]:
def get_indices(df, column, values):
    """Gets the indices of values from the given column of the given dataframe."""
    indices = []
    for value in values:
        indices += df[column][df[column] == value].index.tolist()
    return indices

In [None]:
indices = get_indices(freq_table, "word", ["the", "a", "of", "to", "i", "her", "his"])
interesting = freq_table.loc[indices]
interesting

In [None]:
plt.plot(
    np.log(freq_table["rank"]), np.log(freq_table["frequency"]), ".", markersize=3
)

# This should be a crime.
x_adjust = np.array([0.1, -0.6, 0.15, -0.6, 0.2, -0.6, 0.0])
y_adjust = np.array([1.0, -1.2, 1.0, -1.3, 1.0, -1.3, 1.0])

for word, freq, rank, xa, ya in zip(
    interesting["word"],
    interesting["frequency"],
    interesting["rank"],
    x_adjust,
    y_adjust,
):
    plt.annotate(
        word,
        xy=(np.log(rank), np.log(freq) + ya / 20),
        xytext=(np.log(rank) + xa, np.log(freq) + ya),
        size=9,
        arrowprops={"arrowstyle": "-", "color": "k"},
    )

plt.title("Haiku Word Frequency")
plt.xlabel("$\log(rank)$")
plt.ylabel("$\log(freq)$")
plt.ylim((-0.5, 11.9))
# plt.savefig('zipfs-uncleaned.svg')
plt.show()

## Zipf's Law After Removing Stop Words

We remove the stop words from the bag of words.

In [None]:
for stopword in haikulib.data_utils.STOPWORDS:
    if stopword in bag:
        del bag[stopword]

freq_table = get_freq_table(bag)

In [None]:
plt.plot(
    np.log(freq_table["rank"]), np.log(freq_table["frequency"]), ".", markersize=3
)

plt.title("Haiku Word Frequency")
plt.xlabel("$\log(rank)$")
plt.ylabel("$\log(freq)$")
plt.show()

In [None]:
freq_table["word"].head(15)

In [None]:
indices = get_indices(freq_table, "word", ["moon", "rain", "day", "night", "snow", "winter", "summer", "spring", "autumn"])

interesting = freq_table.loc[indices]
interesting

In [None]:
plt.plot(
    np.log(freq_table["rank"]), np.log(freq_table["frequency"]), ".", markersize=3
)

# This should also be a crime.
x_adjust = np.array([-0.35, -0.9, -0.23, -0.9, -0.1, -0.7, 0.3, -0.7, 0.4])
y_adjust = np.array([1.0, -1.0, 1.1, -1.1, 1.1, -1.4, 1.0, -1.45, 1.0])

for word, freq, rank, xa, ya in zip(
    interesting["word"],
    interesting["frequency"],
    interesting["rank"],
    x_adjust,
    y_adjust,
):
    plt.annotate(
        word,
        xy=(np.log(rank), np.log(freq) + ya / 20),
        xytext=(np.log(rank) + xa, np.log(freq) + ya),
        size=8,
        arrowprops={"arrowstyle": "-", "color": "k"},
    )

plt.title("Haiku Word Frequency")
plt.xlabel("$\log(rank)$")
plt.ylabel("$\log(freq)$")
plt.xlim((-0.5, 10.5))
plt.ylim((-0.5, 9))
# plt.savefig("zipfs-cleaned.svg")
plt.show()

## Word Frequencies After Stemming/Lemmatization

There are two approaches for getting the root form of a word. The first is stemming.

Stemming involves a sequence of rules used to strip off suffixes of the word to reduce it to its stem - which notably might not be a word. For example, "leaves" might be stemmed to form "leav". Further, because stemming operates by removing parts of the word, it would fail to stem "better" and "good" the same.

Lemmatization on the other hand, is aware of the vocabulary. It is a more sophisticated technique that returns the word to its base dictionary form via morphological analysis.

In [None]:
bag = haikulib.data_utils.get_bag_of(column='haiku', kind='words')

for stopword in haikulib.data_utils.STOPWORDS:
    if stopword in bag:
        del bag[stopword]

feq_table = get_freq_table(bag)

# Build a new bag of stems
porter_stems = Counter()
lancaster_stems = Counter()
snowball_stems = Counter()

### Stemming

In [None]:
porter_stemmer = PorterStemmer()
lancaster_stemmer = LancasterStemmer()
snowball_stemmer = SnowballStemmer("english")

In [None]:
for word, frequency in zip(freq_table["word"], freq_table["frequency"]):
    stem = porter_stemmer.stem(word)
    if stem in porter_stems:
        porter_stems[stem] += frequency
    else:
        porter_stems[stem] = frequency

    stem = lancaster_stemmer.stem(word)
    if stem in lancaster_stems:
        lancaster_stems[stem] += frequency
    else:
        lancaster_stems[stem] = frequency

    stem = snowball_stemmer.stem(word)
    if stem in snowball_stems:
        snowball_stems[stem] += frequency
    else:
        snowball_stems[stem] = frequency

Each of the stemmers produce similar results.

In [None]:
print("Original: length:", len(bag), "common words:", bag.most_common(15))
print(
    "Porter: length:",
    len(porter_stems),
    "common stems:",
    porter_stems.most_common(15),
)
print(
    "Lancaster: length:",
    len(lancaster_stems),
    "common stems:",
    lancaster_stems.most_common(15),
)
print(
    "Snowball: length:",
    len(snowball_stems),
    "common stems:",
    snowball_stems.most_common(15),
)

We see the largest compression from the Lancaster stemmer. So we use that to plot the same frequency curve as before.

In [None]:
freq_table = get_freq_table(lancaster_stems)

In [None]:
plt.plot(
    np.log(freq_table["rank"]), np.log(freq_table["frequency"]), ".", markersize=3
)

plt.title("Haiku Stem Frequency")
plt.xlabel("$\log(rank)$")
plt.ylabel("$\log(freq)$")
plt.show()

The shape of the curve does not appear to have changed much from the frequency plot with the stop words removed, except slightly more curved. Perhaps there just aren't that many variants of each word. Or perhaps Zipf's law holds on natural language word stems as well as the words themselves. I think that is more likely.

### Lemmatization

In [None]:
freq_table = get_freq_table(bag)

wn_lemmas = Counter()
wn_pos_lemmas = Counter()
spacy_lemmas = Counter()

In [None]:
lem = WordNetLemmatizer()
for word, frequency in zip(freq_table["word"], freq_table["frequency"]):
    lemma = lem.lemmatize(word)
    if lemma in wn_lemmas:
        wn_lemmas[lemma] += frequency
    else:
        wn_lemmas[lemma] = frequency

In [None]:
def get_pos(word):
    tag = nltk.pos_tag([word])[0][1][0].upper()
    tags = {
        "J": wordnet.ADJ,
        "N": wordnet.NOUN,
        "V": wordnet.VERB,
        "R": wordnet.ADV,
    }
    # Default to a noun if the POS is unknown.
    return tags.get(tag, wordnet.NOUN)

In [None]:
for word, frequency in zip(freq_table["word"], freq_table["frequency"]):
    lemma = lem.lemmatize(word, get_pos(word))
    if lemma in wn_pos_lemmas:
        wn_pos_lemmas[lemma] += frequency
    else:
        wn_pos_lemmas[lemma] = frequency

In [None]:
# horrendously slow
for word, frequency in zip(freq_table["word"], freq_table["frequency"]):
    # This is not what SpaCy was meant for.
    doc = nlp(word)
    token = doc[0]
    lemma = token.lemma_

    if lemma in spacy_lemmas:
        spacy_lemmas[lemma] += frequency
    else:
        spacy_lemmas[lemma] = frequency

In [None]:
print("original: length:", len(bag), "most common:", bag.most_common(15))
print(
    "WordNet: length:",
    len(wn_lemmas),
    "most common:",
    wn_lemmas.most_common(15),
)
print(
    "WordNet with POS: length:",
    len(wn_pos_lemmas),
    "most common:",
    wn_pos_lemmas.most_common(15),
)
print(
    "spaCy: length:",
    len(spacy_lemmas),
    "most common:",
    spacy_lemmas.most_common(15),
)

Note that each of the lemmatizers identifies the same most common lemmas, but with different frequencies. The spaCy lemmatizer does the most compression, so plot the same frequency curve as before using the spaCy lemmas.

In [None]:
freq_table = get_freq_table(spacy_lemmas)

In [None]:
plt.plot(
    np.log(freq_table["rank"]),
    np.log(freq_table["frequency"]),
    ".",
    markersize=3,
)

plt.title("Haiku Lemma Frequency")
plt.xlabel("$\log(rank)$")
plt.ylabel("$\log(freq)$")
plt.show()

The pattern is the same as before.

## Conclusion

My conclusion is that Zipf's law does in fact hold for Haikus. The initial thought was that it might not because haikus are a compressed form of communication. Interestingly, it holds before and after removing stop words - words like "an" and "the", which are quite common.

Zipfs law is stated for tokens in a natural language, but holds even for the stems and lemmas for those tokens. This makes sense, and is not surprising.