# Using Bokeh to analyze the paintings of Bob Ross

This notebook uses data scraped from https://www.twoinchbrush.com/, a Bob Ross fan site. I'm taking inspiration (or wholesale borrowing) from two other articles that use similar data:

- Walt Hickey's [A Statistical Analysis of the Work of Bob Ross ](https://fivethirtyeight.com/features/a-statistical-analysis-of-the-work-of-bob-ross/) written for FiveThirtyEight.com ([github repo](https://github.com/fivethirtyeight/data/tree/master/bob-ross))
- Connor Rothschild's [Bob Ross Virtual Art Gallery](https://connorrothschild.github.io/bob-ross-art-gallery/)  ([github repo](https://github.com/connorrothschild/bob-ross-art-gallery))

Both of these are great examples of data analysis, and the FiveThirtyEight article includes both code and replication data on the site's Github. Worth checking out if you're interested in seeing more.

<img src="https://encrypted-tbn1.gstatic.com/images?q=tbn:ANd9GcRoHDdcekKSGl-5gzbOJNeVbtgpqdwhljlrkYDIw9I58UA2r81dnE_Pof4_E5IQhzLpM5PMKsKP5OIR4aAZwz8zpg" alt="drawing" style="width:200px;display:block;margin-left: auto;margin-right: auto;"/> Groovy

In [None]:
import pandas as pd

from bokeh.io import output_notebook
from bokeh.plotting import figure, show,  output_file, save
from bokeh.models import  ColorBar, LinearColorMapper, CrosshairTool, Span, BasicTicker
from bokeh.transform import transform
import bokeh.palettes

# Configure Bokeh to show plots inline in the notebook


Important! This line will configure Bokeh to allow plots to show up inside your Jupyter notebook instead of opening in a new browser window:

In [None]:
output_notebook()

# Importing the data and making some plots

Start by reading in the Bob Ross data. These results are stored in a JSON file instead of a .csv because we have some nested data in the columns for each painting. The data was collected using the code in the `bobross_scraper.py` file (we also talked about it in week 9)


In [None]:
paintings_df = pd.read_json('bobross_data.json')

paintings_df.head()

The episodes are labeled according to season and episode number, but I want to be able to sort them more easily later, so I'm going to create numeric episode and season indicators. To do this, I'll use a regular expression to extract just the numeric part of the episode name column:

In [None]:
paintings_df['episode'][:5]

In [None]:
paintings_df['season_number'] = pd.to_numeric(paintings_df['episode'].str.extract(r'S([0-9]+)', expand=False))
paintings_df['episode_number'] = pd.to_numeric(paintings_df['episode'].str.extract(r'E([0-9]+)', expand=False))
# you could also do this one with one line like: 
# epnumbers = paintings_df['episode'].str.extract(r'S(?P<season_number>[0-9]+)E(?P<episode_number>[0-9]+)').apply(pd.to_numeric)

Then I'll sort the data frame by season and episode and reset the index so I have a data frame that is ordered from the first episode to the last:

In [None]:
paintings_df = paintings_df.sort_values(['season_number', 'episode_number']).reset_index(drop=True).reset_index(names='episode_sequence')

Next, I'm going to make a dictionary object that matches each `color_names` value to its respective `hexcolors` value. Hexcodes are a way of identifying a color across platforms, so having the hexcodes here will allow us to use plotting palettes that match the colors used in each painting

In [None]:

colors = paintings_df.explode('color_names')['color_names']
hexcodes = paintings_df.explode('hexcolors')['hexcolors']
color_dict = dict(zip(colors, hexcodes))
color_dict

Finally, I'm going to "explode" two of the nested columns to get a data frame with one row for each color used per episode. 

In [None]:
pl=  paintings_df.explode(['hexcolors', 'color_names'])

pl.head(n=10) # notice how episode 1 now has 8 rows - one for each color used in that painting

I'll start by re-creating a couple of the plots from the Bob Ross virtual art gallery. Those graphics were primarily made in D3, which can do a lot of stuff but requires a pretty good working understand of Javascript to use. The plot below shows the distribution of colors across each episode over the entire run of the show. You can hover over any of the rectangles to read some basic information about each episode/color

# Colors by episode

The plot below shows the distribution of colors across each episode over the entire run of the show. You can hover over any of the rectangles to read some basic information about each episode/color. Most of the syntax here has a close resemblance to matplotlib, but the `tooltips` option is new: that part controls what shows up when you hover your mouse over part of the graph. So, this graph will display the episode, title, and the name of each color when you hover over a rectangle inside the plot.

In [None]:
# getting the last episode of each season so I can have a line on the x-axis indicating the seasons of the shwo
seasons = paintings_df.groupby('season_number').max('episode_sequence').reset_index()


p = figure(title="Bob Ross colors by episode",
           x_range=(min(pl.episode_sequence), max(pl.episode_sequence)), # x range goes from first to last episodes
           y_range=list(pl.color_names.value_counts().index),            # y axis will have each color
           width=1920, height=700,            
           toolbar_location='above',
           tooltips=[('episode', '@episode'), ('episode title', '@title'), ('color', '@color_names')]) # tooltips will display when you hover over each rectangle

p.xaxis.ticker = list(seasons['episode_sequence'])                                                     # changing out x-axis ticks to show seasons
p.xaxis.major_label_overrides = dict(zip(seasons['episode_sequence'], seasons['season_number'].astype('str') )) 
p.xgrid.grid_line_color = 'lightgrey'    
p.ygrid.visible=False
p.axis.axis_line_color = None
p.axis.major_label_text_font_size = "16px"
p.background_fill_color = "beige"                                                                   # beige background color to make the whites visible
r = p.rect(y="color_names", x="episode_sequence", width=1, height=1, source=pl,                     # this part adds the rectangles for each episode
           fill_color='hexcolors',
           hover_line_color="black",
           line_color=None)

show(p)

# Color bar plots

I can also make a bar plot for the overall frequency of each color. (this doesn't have much interactivity, so Bokeh is kind of pointless, but its nice to have a consistent aesthetic)

In [None]:
colors_dict = dict(zip(pl.color_names, pl.hexcolors))
pl_counts = pl.value_counts('color_names').reset_index(name='count')
pl_counts['hexcodes'] = [colors_dict.get(i) for i in pl_counts['color_names']]
p = figure(x_range = pl_counts['color_names'], height=450, width=1600, title ='Frequency of each color', toolbar_location = "above")
p.vbar(x='color_names', top='count', source=pl_counts, width=0.9, color='hexcodes')
p.background_fill_color = "beige"

show(p)

# Paintings by subject

Now, I want to know a little more about subjects of each painting. There's a nested column here called "tags" that contains information on what is included in each painting: 

In [None]:
paintings_df['tags'][:5]

I'm going to need to do some data manipulation here to make this work, but the end result is a data frame that has a correlation for every tag compared to every other tag. 
To  start, I'll use `explode` to make my nested list of tags into a long list, then I'll use `pd.crosstab` to create a big matrix of 1s and 0s for each episode. If a painting includes a waterfall - for instance - that column will have a 1, and if it doesn't contain one, it will have a zero. 

In [None]:
pdf = paintings_df.explode('tags')
tag_counts = pd.crosstab(index = pdf.episode_sequence, columns = pdf['tags'])
tag_counts = tag_counts.loc[:, tag_counts.sum(axis=0)>10]
tag_counts.head()

Now I'll take this matrix and calculate the correlations between every tag across every episode. In essence, this correlation matrix reflects how often different things occur together in a Bob Ross painting. 

In [None]:
corr_matrix = tag_counts.corr()
corr_matrix.head()

To make our correlation matrix a little more visually appealing, I'll also use a function from [this post](https://wil.yegelwel.com/cluster-correlation-matrix/) by Wil Yegelwel that helps to sort big correlation matrices so they look less chaotic.

In [None]:
from cluster_correlation_matrix import cluster_corr
corr_matrix = cluster_corr(corr_matrix)


Finally, I'll use `stack` to convert this wide format matrix back into a long-format data frame. In `corr_long` when `value` has a positive value, it means that `tag_x` and `tag_y` tend to occur together, if `value` is negative, it means that `tag_x` is less likely when `tag_y` is present (and vice-versa)


In [None]:
corr_long = corr_matrix.stack().rename_axis(['tags_x', 'tags_y']).rename('value').reset_index()

We'll use a heatmap to visualize this long list of correlations. The heatmap here shows the associations between different subjects. And its color coded so that blue indicates negative associations and red indicates positive associations. So, for instance, we can see that "winter" and "autunm" paintings are negatively associated (makes sense, because how is he going to paint winter and autunm at the same time?)

In [None]:


# You can use your own palette here
colors = ['#d7191c', '#fdae61', '#ffffbf', '#a6d96a', '#1a9641']

custom_tooltip= """ 
        <span style="font-size: 17px;"><strong>@tags_x, @tags_y:</strong>@value{0.00}  </span>'
        """

# Had a specific mapper to map color with value
mapper = LinearColorMapper(
    palette= bokeh.palettes.tol['BuRd'][6], low=-.75, high=.75)
# Define a figure
p = figure(
    width=1920,
    height=1080,
    title="Correlation of features in Bob Ross paintings",
    x_range=list(corr_long.tags_x.unique()),
    y_range=list(corr_long.tags_y.unique()),
   # toolbar_location=None,
   # tools="",
    tooltips = custom_tooltip,
    x_axis_location="above")


# Create rectangle for heatmap
p.rect(
    x="tags_x",
    y="tags_y",
    width=1,
    height=1,
    source=corr_long,
    line_color='lightgrey',
    fill_color=transform('value', mapper))
# Add legend
color_bar = ColorBar(
    color_mapper=mapper,
    location=(0, 0),
    ticker=BasicTicker(desired_num_ticks=len(colors)))

p.xaxis.major_label_orientation = .45
p.axis.major_label_text_font_size = "16px"


p.add_layout(color_bar, 'right')
width = Span(dimension="width", line_dash="dashed", line_width=2)
height = Span(dimension="height", line_dash="dotted", line_width=2)

p.add_tools(CrosshairTool(overlay=[width, height]))

show(p)

# Clustering and Principal Components


We can go a step further here by using some tools from the world of machine learning. Based on the associations I'm seeing, I think we could categorize the paintings into a small set of recurring themes. For instance, if you explore the correlation heatmap above, you might notice that things like "cabin" and "winter" and "mountains" are all positively correlated, so I suspect there are a lot of paintings here that are of things like "snow-covered cabins in the mountains"

To identify this smaller number of general themes, I can use K-means clustering. K-Means is a simple clustering algorithm that identifies "K" clusters from some input data. The number of clusters is determined by the researcher (this is often more of an art than a science) and we expect observations in the same cluster to have broadly similar values.

I also want to be able to visualize the results of my cluster analysis. Ideally, I'd like to visualize something like this in a scatter plot, but I have a large number of dimensions in my data set instead of just an "X" and a "Y". So I'm going to use Principal Components Analysis to create a low-dimensional representation of my data that captures most of the variation in a smaller number of variables.

In [None]:
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler


I'll scale my data to prepare it for K-means clustering (scaling makes it so that all of the variables have the same mean and standard deviation), and I'll use the `Kmeans()` function to perform the clustering. I'm going to set `k=7` to get 7 clusters, and I also want to be sure to set the `random_state` variable to ensure that my results are replicable (the K-means algorithm is non-deterministic, so the only way to ensure we get the same results every time is to control how the random number generation gets intialized)



In [None]:
# scale the data for k means clustering
scaler = StandardScaler()
scaler.fit(tag_counts)
kmeans = KMeans(n_clusters = 7, random_state = 999, n_init='auto')
kmeans.fit(tag_counts)



In [None]:
tag_counts.shape

Now that I have my clusters, I'll get the cluster labels and add them as a new column on to `paintings_df`

In [None]:
paintings_df['cluster'] = kmeans.labels_


Finally, I'll run the principal components analysis to reduce the number of dimensions from around 40 to just two, and I'll add these two principal components to my data frame as well.  

In [None]:
pca = PCA(n_components=2)
principalComponents = pca.fit_transform(tag_counts)
pca_data = pd.DataFrame(data = principalComponents
             , columns = ['pc1', 'pc2'])
paintings_df = pd.concat([pca_data, paintings_df], axis=1)

And I'll add some color-coding and shapes into my data frame so that our plot has different colors and shapes for each cluster.

In [None]:
cluster_colors = ['Alizarin Crimson', 'Van Dyke Brown', 'Cadmium Yellow', 'Yellow Ochre', 'Phthalo Blue', 'Bright Red', 'Sap Green']
cluster_markers = ['hex', 'circle', 'triangle', 'diamond', 'plus','star', 'square','square_pin', 'triangle_pin' ]
paintings_df['cluster_color'] = [color_dict.get(cluster_colors[i]) for i in paintings_df.cluster]



In [None]:
paintings_df.head()

## Bob Ross cluster analysis
Now, we'll lay out a scatter plot to display the results of the cluster analysis, and we'll use the first two principle components to set the location of each point. We'll color-code each marker based on its cluster membership, and we'll add a customized tooltip that will display the image of the painting along some some additional information like the colors used, tags, and episode.

Clicking one of the legend markers will hide the points for that cluster. And hovering over a point will show additional data on each painting. See if you can identify some common themes associated with the different clusters.

In [None]:

# Customized HTML tooltip. The parts with an @colname will be filled in with data from my data frame.
TOOLTIPS = """
    <div style="width:400px;">
        <div>
            <img
                src="@image_url" height="25%"  width="75%"
                style="float: above;   display: block; margin-left: auto; margin-right: auto;
                border="2"
            ></img>
        </div>
        <div>
            <span style="font-size: 17px; font-weight: bold;">@title</span>
                <div>
                    <span>Season @season_number - Episode @episode_number</span>
                </div>
            <br>
        </div>
        
        <div>
            <span style="font-size: 12px; color: #966;"><strong>Colors:</strong> @color_names</span>
        </div>
        <div>
            <span style="font-size: 12px"><strong>Tags:</strong> @tags</span>
        </div>
        <div>
            <span style="font-size: 15px;">Location</span>
            <span style="font-size: 10px; color: #696;">($x, $y)</span>
        </div>
    </div>
"""

#
p = figure(title="Bob Ross Painting Clusters Analysis",
           tooltips=TOOLTIPS,
           x_range=(min(paintings_df.pc1)-1, max(paintings_df.pc1)+1),
           y_range=(min(paintings_df.pc2)-1, max(paintings_df.pc2)+1),            
           width=800, height=800,   
           x_axis_label="PC 1",
           y_axis_label="PC 2",
           toolbar_location='above') 
                                                        

# loop through each unique cluster in order. Doing this allows us to have an interactive legend on the plot
for i in paintings_df.cluster.sort_values().unique():
    data = paintings_df[paintings_df['cluster'] == i]
    p.scatter(x='pc1', y='pc2',  
             source=data,   
             legend_label = 'cluster: ' + str(i),
             fill_color = 'cluster_color',
             marker = cluster_markers[i],
             color = color_dict[cluster_colors[i]],
             line_color = 'black',
             alpha =.8,
             size=15)

p.background_fill_color = "beige"
p.legend.label_text_font_size = '20pt'

p.legend.click_policy="hide"
p.legend.location = "top_right"
show(p)