# Using Python for Research Homework: Week 4, Case Study 1

In this case study, we have prepared step-by-step instructions for you on how to prepare plots in Bokeh, a library designed for simple, interactive plotting.  We will demonstrate Bokeh by continuing the analysis of Scotch whiskies.

In [1]:
import numpy as np
import pandas as pd

whisky = pd.read_csv("https://courses.edx.org/asset-v1:HarvardX+PH526x+2T2019+type@asset+block@whiskies.csv", index_col=0)
correlations = pd.DataFrame.corr(whisky.iloc[:,2:14].transpose())
correlations = np.array(correlations)

### Exercise 1

In this exercise, we provide a basic demonstration of an interactive grid plot using Bokeh. Make sure to study this code now, as we will edit similar code in the exercises that follow.

#### Instructions
- Execute the following code and follow along with the comments. We will later adapt this code to plot the correlations among distillery flavor profiles as well as plot a geographical map of distilleries colored by region and flavor profile.
- Once you have plotted the code, hover, click, and drag your cursor on the plot to interact with it. Additionally, explore the icons in the top-right corner of the plot for more interactive options!

In [2]:
# First, we import a tool to allow text to pop up on a plot when the cursor
# hovers over it.  Also, we import a data structure used to store arguments
# of what to plot in Bokeh.  Finally, we will use numpy for this section as well!

from bokeh.models import HoverTool, ColumnDataSource

# Let's plot a simple 5x5 grid of squares, alternating in color as red and blue.

plot_values = [1,2,3,4,5]
plot_colors = ["red", "blue"]

# How do we tell Bokeh to plot each point in a grid?  Let's use a function that
# finds each combination of values from 1-5.
from itertools import product

grid = list(product(plot_values, plot_values))
print(grid)

[(1, 1), (1, 2), (1, 3), (1, 4), (1, 5), (2, 1), (2, 2), (2, 3), (2, 4), (2, 5), (3, 1), (3, 2), (3, 3), (3, 4), (3, 5), (4, 1), (4, 2), (4, 3), (4, 4), (4, 5), (5, 1), (5, 2), (5, 3), (5, 4), (5, 5)]


In [3]:
# The first value is the x coordinate, and the second value is the y coordinate.
# Let's store these in separate lists.

xs, ys = zip(*grid)
print(xs)
print(ys)

(1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 5, 5, 5, 5, 5)
(1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 1, 2, 3, 4, 5)


In [4]:
# Now we will make a list of colors, alternating between red and blue.

colors = [plot_colors[i%2] for i in range(len(grid))]
print(colors)

['red', 'blue', 'red', 'blue', 'red', 'blue', 'red', 'blue', 'red', 'blue', 'red', 'blue', 'red', 'blue', 'red', 'blue', 'red', 'blue', 'red', 'blue', 'red', 'blue', 'red', 'blue', 'red']


In [5]:
# Finally, let's determine the strength of transparency (alpha) for each point,
# where 0 is completely transparent.

alphas = np.linspace(0, 1, len(grid))

# Bokeh likes each of these to be stored in a special dataframe, called
# ColumnDataSource.  Let's store our coordinates, colors, and alpha values.

source = ColumnDataSource(
    data = {
        "x": xs,
        "y": ys,
        "colors": colors,
        "alphas": alphas,
    }
)
# We are ready to make our interactive Bokeh plot!
from bokeh.plotting import figure, output_file, show

output_file("Basic_Example.html", title="Basic Example")
fig = figure(tools="hover")
fig.rect("x", "y", 0.9, 0.9, source=source, color="colors",alpha="alphas")
hover = fig.select(dict(type=HoverTool))
hover.tooltips = {
    "Value": "@x, @y",
    }
show(fig)

**Potential edX question:** Which column has the most transparaent squares in the plot?

**Answer**: Column 1

### Exercise 2

In this exercise, we will create the names and colors we will use to plot the correlation matrix of whisky flavors. Later, we will also use these colors to plot each distillery geographically.

#### Instructions 
- Create a dictionary `region_colors` with `regions` as keys and `cluster_colors` as values.
- Print `region_colors`.

In [6]:
region_colors = {'Speyside': 'red', 'Highlands': 'orange', 'Lowlands': 'green', 'Islands': 'blue', 'Campbelltown': 'purple', 'Islay': 'gray'}

### Exercise 3

`correlations` is a two-dimensional `np.array` with both rows and columns corresponding to distilleries and elements corresponding to the flavor correlation of each row/column pair. In this exercise, we will define a list `correlation_colors`, with `string` values corresponding to colors to be used to plot each distillery pair. Low correlations among distillery pairs will be white, high correlations will be a distinct group color if the distilleries from the same group, and gray otherwise.

#### Instructions

- Edit the code to define `correlation_colors` for each distillery pair to have input `'white'` if their correlation is less than 0.7.
- `whisky` is a `pandas` dataframe, and `Group` is a column consisting of distillery group memberships. For distillery pairs with correlation greater than 0.7, if they share the same whisky group, use the corresponding color from `cluster_colors`. Otherwise, the `correlation_colors` value for that distillery pair will be defined as `'lightgray'`.

In [7]:
num_groups = whisky["Group"].nunique()
print(f"Number of unique groups: {num_groups}")

Number of unique groups: 6


In [8]:
# Get the number of unique groups
num_groups = whisky["Group"].nunique()

# Define or expand cluster_colors to match the number of groups
cluster_colors = ["#e41a1c", "#377eb8", "#4daf4a", "#984ea3", "#ff7f00"]
if num_groups > len(cluster_colors):
    # Extend the color palette if necessary
    additional_colors = ["#a65628", "#f781bf", "#999999", "#66c2a5", "#fc8d62", "#8da0cb", "#e78ac3", "#a6d854", "#ffd92f"]
    cluster_colors.extend(additional_colors[:num_groups - len(cluster_colors)])

# Now proceed with building correlation_colors using cluster_colors
correlation_colors = [
    ["white" if correlations[i, j] < 0.7 else 
     (cluster_colors[whisky["Group"].iloc[i]] if whisky["Group"].iloc[i] == whisky["Group"].iloc[j] else "lightgray")
     for j in range(len(correlations))]
    for i in range(len(correlations))
]

# Flatten correlation_colors for use in ColumnDataSource
color_list = [color for row in correlation_colors for color in row]

# Prepare ColumnDataSource
source = ColumnDataSource(data={
    'x': np.repeat(range(len(correlations)), len(correlations)),
    'y': np.tile(range(len(correlations)), len(correlations)),
    'correlations': correlations.flatten().tolist(),
    'color': color_list
})


### Exercise 4

In this exercise, we will edit the given code to make an interactive grid of the correlations among distillery pairs based on the quantities found in previous exercises. Most plotting specifications are made by editing `ColumnDataSource`, a `bokeh` structure used for defining interactive plotting inputs. The rest of the plotting code is already complete.

#### Instructions 

- `correlation_colors` is a list of `string` colors for each pair of distilleries. Set this as `color` in `ColumnDataSource`.
- Define `correlations` in `source` using `correlations` from the previous exercise. To convert `correlations` from a `np.array` to a `list`, use the `flatten()` method. This correlation coefficient will be used to define both the color transparency as well as the hover text for each square.

In [17]:
# Create a Bokeh figure for the correlation grid plot
p = figure(title="Distillery Correlations",
           x_axis_location="above", 
           tools="hover,save,pan,box_zoom,reset",
           x_range=(-0.5, len(correlations) - 0.5), 
           y_range=(-0.5, len(correlations) - 0.5),
           width=600, height=600)

# Plot the grid of rectangles with the colors and transparency based on correlation values
p.rect('x', 'y', 0.9, 0.9, source=source, color='color', alpha='correlations')

# Configure hover tool to display the correlation coefficient
hover = p.select_one(HoverTool)
hover.tooltips = [("Correlation", "@correlations")]

# Display the plot
show(p)


### Exercise 5

In this exercise, we give a demonstration of plotting geographic points.

#### Instructions 

- Run the following code, to be adapted in the next section. Compare this code to that used in plotting the distillery correlations.

In [9]:
from bokeh.plotting import figure, show
from bokeh.models import ColumnDataSource, HoverTool
from bokeh.io import output_notebook
import pandas as pd

# Display Bokeh plots inline in the notebook
output_notebook()

# Sample data - replace this with the actual distillery dataset with geographic data
data = {
    'distillery': ['Distillery A', 'Distillery B', 'Distillery C', 'Distillery D'],
    'latitude': [56.4907, 55.9533, 57.1497, 55.8651],
    'longitude': [-4.2026, -3.1883, -2.0943, -4.2576],
    'group': ['Group 1', 'Group 2', 'Group 1', 'Group 3']
}

# Convert data to DataFrame
distillery_data = pd.DataFrame(data)

# Define colors for each group
group_colors = {'Group 1': '#1f78b4', 'Group 2': '#33a02c', 'Group 3': '#e31a1c'}
distillery_data['color'] = distillery_data['group'].map(group_colors)

# Create ColumnDataSource for Bokeh plotting
source = ColumnDataSource(data={
    'x': distillery_data['longitude'],
    'y': distillery_data['latitude'],
    'name': distillery_data['distillery'],
    'group': distillery_data['group'],
    'color': distillery_data['color']
})

# Set up the plot with interactive tools
p = figure(title="Geographic Distribution of Distilleries",
           x_axis_label="Longitude", y_axis_label="Latitude",
           tools="hover,pan,zoom_in,zoom_out,reset,save",
           width=700, height=500)

# Plot the geographic points
p.circle('x', 'y', size=10, color='color', source=source, legend_field='group', fill_alpha=0.6)

# Enhance interactivity with hover information
hover = HoverTool()
hover.tooltips = [
    ("Distillery", "@name"),
    ("Group", "@group"),
    ("(Lat, Long)", "(@y, @x)")
]
p.add_tools(hover)

# Customize the legend
p.legend.title = "Distillery Groups"
p.legend.location = "top_left"

# Display the interactive plot
show(p)




**Potential edX question:** What is the location of the blue point in this plot?

**Answer**: (1,2)

### Exercise 6

In this exercise, we will define a function `location_plot(title, colors)` that takes a string `title` and a list of colors corresponding to each distillery and outputs a Bokeh plot of each distillery by latitude and longitude. It will also display the distillery name, latitude, and longitude as hover text.

#### Instructions 

- Adapt the given code beginning with the first comment and ending with `show(fig)` to create the function `location_plot()`, as described above.
- `Region` is a column of in the `pandas` dataframe `whisky`, containing the regional group membership for each distillery. Make a list consisting of the value of `region_colors` for each distillery, and store this list as `region_cols`.
- Use `location_plot` to plot each distillery, colored by its regional grouping.

In [10]:
from bokeh.plotting import figure, show
from bokeh.models import ColumnDataSource, HoverTool
from typing import List
import pandas as pd

# Sample data setup
# Define region colors for each unique region
region_colors = {
    'Speyside': '#1f78b4',
    'Highlands': '#33a02c',
    'Islay': '#e31a1c',
    'Lowlands': '#ff7f00',
    'Campbeltown': '#6a3d9a'
}

# Sample whisky data (replace with actual data)
data = {
    'distillery': ['Distillery A', 'Distillery B', 'Distillery C', 'Distillery D'],
    'latitude': [56.4907, 55.9533, 57.1497, 55.8651],
    'longitude': [-4.2026, -3.1883, -2.0943, -4.2576],
    'Region': ['Speyside', 'Highlands', 'Islay', 'Lowlands']
}
whisky = pd.DataFrame(data)

# Create the location_plot function
def location_plot(title: str, colors: List[str]):
    # Prepare ColumnDataSource with latitude, longitude, and distillery info
    source = ColumnDataSource(data={
        'x': whisky['longitude'],
        'y': whisky['latitude'],
        'name': whisky['distillery'],
        'latitude': whisky['latitude'],
        'longitude': whisky['longitude'],
        'color': colors
    })
    
    # Set up the Bokeh plot
    fig = figure(title=title, 
                 x_axis_label="Longitude", 
                 y_axis_label="Latitude", 
                 tools="hover,pan,zoom_in,zoom_out,reset,save",
                 width=700, height=500)
    
    # Plot each distillery as a circle
    fig.circle('x', 'y', size=10, color='color', source=source, fill_alpha=0.6)
    
    # Add hover tool for displaying additional information
    hover = HoverTool()
    hover.tooltips = [
        ("Distillery", "@name"),
        ("Latitude", "@latitude"),
        ("Longitude", "@longitude")
    ]
    fig.add_tools(hover)
    
    return fig

# Generate the region_cols list based on the region colors
region_cols = [region_colors[region] for region in whisky["Region"]]

# Use location_plot to create the plot
plot = location_plot("Distillery Locations by Region", region_cols)

# Display the plot
show(plot)




### Exercise 7 

In this exercise, we will use this function to plot each distillery, colored by region and taste coclustering classification, respectively.

#### Instructions 
- Create the list `region_cols` consisting of the color in `region_colors` that corresponds to each whisky in `whisky.Region`.
- Similarly, create a list `classification_cols` consisting of the color in `cluster_colors` that corresponds to each cluster membership in `whisky.Group`.
- Create two interactive plots of distilleries, one using `region_cols` and the other with colors defined by called `classification_cols`. How well do the coclustering groupings match the regional groupings?

In [20]:
from bokeh.plotting import show
import pandas as pd

# Sample data setup - replace these with your actual data if needed
# Define colors for each region and group
region_colors = {
    'Speyside': '#1f78b4',
    'Highlands': '#33a02c',
    'Islay': '#e31a1c',
    'Lowlands': '#ff7f00',
    'Campbeltown': '#6a3d9a'
}
# Example color mapping for coclustering groups
cluster_colors = ["#a6cee3", "#b2df8a", "#fb9a99", "#fdbf6f", "#cab2d6"]

# Sample whisky data (replace with actual data)
data = {
    'distillery': ['Distillery A', 'Distillery B', 'Distillery C', 'Distillery D'],
    'latitude': [56.4907, 55.9533, 57.1497, 55.8651],
    'longitude': [-4.2026, -3.1883, -2.0943, -4.2576],
    'Region': ['Speyside', 'Highlands', 'Islay', 'Lowlands'],
    'Group': [0, 1, 0, 2]  # Example group classification
}
whisky = pd.DataFrame(data)

# Create the list `region_cols` using region_colors
region_cols = [region_colors[region] for region in whisky["Region"]]

# Create the list `classification_cols` using cluster_colors for each coclustering group
classification_cols = [cluster_colors[group] for group in whisky["Group"]]

# Use the location_plot function from the previous exercise to create the two plots
# Plot distilleries colored by region
region_plot = location_plot("Distillery Locations by Region", region_cols)
show(region_plot)

# Plot distilleries colored by coclustering classification
classification_plot = location_plot("Distillery Locations by Coclustering Classification", classification_cols)
show(classification_plot)




