The goal of this notebook is to build a color palette of my haiku dataset in the same vein as a PyCon 2017 conference talk titled [Gothic Colors: Using Python to understand color in nineteenth century literature](https://www.youtube.com/watch?v=3dDtACSYVx0).

This conference talk was the first application of programming to a soft science that I recall being exposed to, and it's made a lasting impression.
Ever since watching the talk, I've wanted to apply scientific techniques to solve non-scientific, soft, and natural problems.

Here, I intend to parse the use of color from the haiku in an intelligent manner -- one that is aware that the word "rose" has different meanings in the sentences

* "I picked a rose."
* "Her shoes were rose colored."
* "He rose to greet me."

In a sense, however, the first two uses both contribute to the sense of a "color palette" for haiku, so we care only about excluding the third case.

In order to do perform this differentiation, the haiku corpus must be part-of-speech tagged.
That is, each word must be annotated with its part of speech.
This is a daunting task for such a large corpus -- as of the time of this notebook, the corpus contains over 178,000 words!

Fortunately POS-tagging is not a new problem, and there exist out-of-the-box methods for performing POS tagging.

In [1]:
# Automagically reimport haikulib if it changes.
%load_ext autoreload
%autoreload 2

%config InlineBackend.figure_format = 'svg'
%matplotlib inline

import collections

import matplotlib.pyplot as plt
import nltk
import pandas as pd
import seaborn as sns
from IPython.display import Image

from haikulib import data, nlp, utils
from haikulib.eda import colors

data_dir = data.get_data_dir() / "experiments" / "eda" / "colors"
data_dir.mkdir(parents=True, exist_ok=True)
pd.set_option("display.latex.repr", True)
pd.set_option("display.latex.longtable", True)
plt.rcParams["figure.figsize"] = (16, 9)
sns.set()

# The Naive Approach

It's often useful to implement a simpler version of a feature before implementing the full functionality.
So before performing POS-tagging and more intelligent color identification, we simply look for any occurance of a color name in the haiku corpus.

We do so by stripping the `/` and `#` meta-tokens from each haiku, then look for any $n$ -grams from the corpus that match our list of color names.
We use $n \in \{1, 2, 3\}$.

In [2]:
# Form list of haiku without '/' and '#' symbols
df = data.get_df()
corpus = []

for haiku in df["haiku"]:
    corpus.append(" ".join(line.strip(" #") for line in haiku.split("/")))

color_names = data.get_colors_dict()

In [3]:
%%time
naive_colors = collections.Counter()
for haiku in corpus:
    # Update the color counts for this haiku.
    naive_colors.update(
        nlp.count_tokens_from(haiku, color_names, ngrams=[1, 2, 3])
    )

CPU times: user 706 ms, sys: 3.74 ms, total: 710 ms
Wall time: 710 ms


Here, we build a data frame of the color occurences for ease of use in visualization.
Before it was sufficient to use the `collections.Counter()` object directly in generating the word cloud, but now we prefer more a more structured data form.

In [4]:
# Relies on dicts being sorted, added in Python 3.6, guaranteed by Python 3.7
naive_color_counts = pd.DataFrame({"color": list(naive_colors.keys()), "count": list(naive_colors.values()), "html_color": [color_names[c] for c in naive_colors]})

total_color_count = sum(row["count"] for index, row in naive_color_counts.iterrows())

print(f"There are {total_color_count} occurences of color in the corpus")
print(f"There are {len(naive_color_counts)} unique colors")

naive_color_counts.head(10)

There are 21154 occurences of color in the corpus
There are 422 unique colors


Unnamed: 0,color,count,html_color
0,green,415,#15b01a
1,snow,1981,snow
2,dusk,482,#4e5481
3,sea,679,#3c9992
4,watermelon,27,#fd4659
5,sky,1232,#82cafc
6,stone,275,#ada587
7,sand,300,#e2ca76
8,rust,27,#a83c09
9,forest,158,#0b5509


# Parsing Colors using Part-Of-Speech Tagging

Rather than implement the color parsing as a part of this notebook, it is performed as a part of the `haikulib.eda` library so that the color parsing can be done *on creation* of the `haiku.csv` cleaned data file.
This enables using the results of this analysis in other exploration.

However, it's useful to examine the implementation of the color parsing code to demonstrate how it works.
In order to do this in a manner that prevents copy-pasting implementations --- which inevitably leads to multiple out-of-sync versions of the same code --- I wrote a small introspective helper function to render the source code of the given function as syntax-highlighted HTML in a Jupyter notebook.

In [5]:
utils.display_source('haikulib.utils', 'display_source')

We can determine if a word is a color simply by checking if it is contained in our master list of colors, and by checking if it is an adjective or a noun.

In [6]:
utils.display_source('haikulib.eda.colors', 'is_color')

However, this relies on each word in the corpus being tagged with their corresponding part-of-speech.
This too is simple.

In [7]:
utils.display_source('haikulib.nlp', 'pos_tag')

Notice that the line separators and end-of-haiku symbols are ignored, as they do not have a part of speech.

Now we can simply find all of the colors in a given haiku as follows.

In [8]:
# Modified to test colors of all three sizes.
haiku = "dark blue lines / in a light olive green sea salt / dreams #"
haiku_colors = [
    tagged_word[0]
    for tagged_word in nlp.pos_tag(haiku)
    if colors.is_color(tagged_word)
]
print(haiku_colors)

['dark', 'blue', 'olive', 'green', 'sea']


But what about finding the color "dark blue"?
In order to find multi-word colors, we need to parse and test $n$ -grams from the haiku.

In [9]:
utils.display_source('haikulib.eda.colors', 'find_colors')

Notice that we only use the `is_color()` method discussed above to determine if single-token words are colors.
The requirements for ngrams being a color is relaxed to a simple containment check --- is the ngram in our list of known colors?

Further notice that there is soul-crushing logic used to parse the colors `["light olive green", "sea"]` from the string `"light olive green sea"` instead of the colors `["olive", "green", "sea", "olive green", "light olive green"]`.

In [10]:
colors.find_colors(nlp.pos_tag(haiku))

['dark blue', 'light olive green', 'sea']

Then we can parse colors from the haiku before saving the haiku in the `haiku.csv` data file.
This enables spatial exploration of the colors, because they are associated with individual haiku rather than building a simple `collections.Counter` object of colors as above with the naive approach.

In [11]:
utils.display_source('haikulib.data.initialization', 'init_csv')

In [12]:
df = data.get_df()
df.tail(6)

Unnamed: 0,haiku,colors,lines
55363,the first dusk of may / it's suddenly there / ...,"[dusk, pale orange]",3
55364,dandelions / two drinkers unzip / by the path #,[],3
55365,scrunched up clouds / a blue plastic bag / in ...,[blue],3
55366,early spring / one morning the clang of / scaf...,[],3
55367,platform 9 / through the glass roof the shapes...,[glass],3
55368,sneaking out / thru' the bathroom window / ste...,[],3


We can also produce a `DataFrame` containing the colors, their counts, and their HTML color codes as above.

In [13]:
pos_tagging_color_counts = colors.get_color_counts()

total_color_count = sum(
    row["count"] for _, row in pos_tagging_color_counts.iterrows()
)

print(f"There are {total_color_count} occurences of color in the corpus")
print(f"There are {len(pos_tagging_color_counts)} unique colors")

pos_tagging_color_counts.head(10)

There are 18531 occurences of color in the corpus
There are 400 unique colors


Unnamed: 0,color,count,html_color
0,green,367,#15b01a
1,snow,1511,snow
2,dusk,357,#4e5481
3,sea,635,#3c9992
4,watermelon,27,#fd4659
5,sky,1035,#82cafc
6,stone,269,#ada587
7,sand,291,#e2ca76
8,rust,22,#a83c09
9,forest,51,#0b5509


Compare the POS-tagging results with those from the naive approach, summarized again below.
Notice that we pruned *twenty-two* unique colors by using POS-tagging, and pruned over *three thousand* occurences of color words that were not tagged as adjectives or nouns, or duplicated by the occurence of an ngram.

In [14]:
total_color_count = sum(row["count"] for index, row in naive_color_counts.iterrows())

print(f"There are {total_color_count} occurences of color in the corpus")
print(f"There are {len(naive_color_counts)} unique colors")

naive_color_counts.head(10)

There are 21154 occurences of color in the corpus
There are 422 unique colors


Unnamed: 0,color,count,html_color
0,green,415,#15b01a
1,snow,1981,snow
2,dusk,482,#4e5481
3,sea,679,#3c9992
4,watermelon,27,#fd4659
5,sky,1232,#82cafc
6,stone,275,#ada587
7,sand,300,#e2ca76
8,rust,27,#a83c09
9,forest,158,#0b5509


# Color Palette Visualization

There are a number of palette visualization techniques we could use.
We will visualize the haiku color palette using
* Word cloud
* Histogram
* Pie Chart
* Ordered Grid
* Spectrum

## Word Cloud

## Histogram

## Pie Chart

## Chronological Grid

## Spectrum

**TODO: This may not be very enlightening because it's difficult to describe different shades of color.**

# Color Adjacency Graph

**TODO: Maybe move to a different notebook?**