The goal of this notebook is to examine word frequency in the haiku dataset in a more qualitative and subjective manner.

The intent is to build a word cloud not only for all of the words in the corpus, but also for

* flowers
* colors
* animals

In [None]:
# Automagically reimport haikulib if it changes.
%load_ext autoreload
%autoreload 2
%aimport haikulib.utils.data
%aimport haikulib.utils.nlp

from collections import Counter
from IPython.display import Image

from wordcloud import WordCloud

data_dir = haikulib.utils.data.get_data_dir() / "experiments" / "eda" / "word_clouds"
data_dir.mkdir(parents=True, exist_ok=True)

# Word Cloud with Stop Words

If we build the word cloud without removing stop words, the results are less illuminating.

In [None]:
bag = haikulib.utils.data.get_bag_of(column="haiku", kind="words")
wordcloud = WordCloud(
    max_words=500, width=1600, height=900, background_color="white"
).generate_from_frequencies(bag)

wordcloud.to_file(data_dir / "all-words.png")
# Render generated image at full resolution in a manner that doesn't cache the images.
Image(data_dir / "all-words.png")

# Word Cloud without Stop Words

However, once all of the stop words are removed, we begin to see more interesting results.

As it was put to me, the results are quite stereotypical, but then stereotypes exist for a reason, and in this particular case they seem to be supported by evidence.

In [None]:
bag = haikulib.utils.data.get_bag_of(column="nostopwords", kind="words")
wordcloud = WordCloud(
    max_words=500, width=1600, height=900, background_color="white"
).generate_from_frequencies(bag)

wordcloud.to_file(data_dir / "without-stopwords.png")
Image(data_dir / "without-stopwords.png")

# Parsing the Haiku Corpus for Specific n-Grams

In order to build correct (for some definition of correct) word clouds of flower, color, and animal occurences, it's necessary to parse and find occurances of the multi-word tokens in the haiku corpus.

The bag-of-words representation of the dataset is not the appropriate representation for finding ngrams.
So we proceed by building a different representation of the haiku corpus, and then count ngrams of sizes 1, 2, and 3 that occur in the sets of color, flora, and fauna names.

In [None]:
# Form a list of haiku without the `/` and `#` symbols.
df = haikulib.utils.data.get_df()
corpus = []

for haiku in df["haiku"]:
    corpus.append(" ".join(line.strip(" #") for line in haiku.split("/")))

color_names = haikulib.utils.data.get_colors()
flower_names = haikulib.utils.data.get_flowers()
animal_names = haikulib.utils.data.get_animals()

In [None]:
%%time
colors = Counter()
flowers = Counter()
animals = Counter()

for haiku in corpus:
    # Update the counts for this haiku.
    colors.update(
        haikulib.utils.count_tokens_from(haiku, color_names, ngrams=[1, 2, 3])
    )
    flowers.update(
        haikulib.utils.count_tokens_from(haiku, flower_names, ngrams=[1, 2, 3])
    )
    animals.update(
        haikulib.utils.count_tokens_from(haiku, animal_names, ngrams=[1, 2, 3])
    )

# Flora Word Cloud

There are a large amount of flora mentioned in the haiku, so I thought it would be entertaining to look at a word cloud of flowers and trees mentioned in the corpus.

In [None]:
wordcloud = WordCloud(
    max_words=500, width=1600, height=900, background_color="white"
).generate_from_frequencies(flowers)

wordcloud.to_file(data_dir / "flora.png")
Image(data_dir / "flora.png")

# Color Word Cloud

One of the most interesting and unexpected applications of programming that I have found was a PyCon 2017 conference talk titled [Gothic Colors Using Python to understand color in nineteenth century literature](https://www.youtube.com/watch?v=3dDtACSYVx0).
This was the first application of programming to a soft science that I recall having been exposed to, and it's made a lasting impression.

Ever since watching the talk, I've wanted to apply scientific techniques to solve non-scientific problems.

I still intend on producing a color palette for haiku, but in the mean time, a word cloud of color names will do.
The color names and their RGB values have been taken from [https://xkcd.com/color/rgb/](https://xkcd.com/color/rgb/).

In [None]:
wordcloud = WordCloud(
    max_words=100, width=1600, height=900, background_color="white"
).generate_from_frequencies(colors)

# Set the colors to the actual RGB color values experimentally determined to be associated with that color name.
# See: https://xkcd.com/color/rgb/
for i, layout in enumerate(wordcloud.layout_):
    (color, a), b, c, d, _ = layout
    # Black on a black background doesn't look so hot.
    rgb = color_names[color] if color != "white" else color_names["ice"]
    wordcloud.layout_[i] = ((color, a), b, c, d, rgb)

wordcloud.to_file(data_dir / "colors.png")
Image(data_dir / "colors.png")

# Fauna Word Cloud

A host of flora and fauna are mentioned in the haiku dataset, so I want to produce a word cloud for animals mentioned in haiku as well.

In [None]:
wordcloud = WordCloud(
    max_words=500, width=1600, height=900, background_color="white"
).generate_from_frequencies(animals)

wordcloud.to_file(data_dir / "fauna.png")
Image(data_dir / "fauna.png")