![Practicum AI Logo image](https://github.com/PracticumAI/practicumai.github.io/blob/main/images/logo/PracticumAI_logo_250x50.png?raw=true)  <img src='images/05/data_visualization.png' align='right' width=50>

# *Practicum AI Data*: Data Visualization Introduction

The exercise is adapted from Mario Döbler & Tim Großmann (2020) <i>The Data Visualization Workshop</i> from <a href="https://www.packtpub.com/product/the-data-visualization-workshop/9781800568846">Packt Publishers</a> and the <a href="https://github.com/PacktWorkshops/The-Data-Visualization-Workshop">Software Carpentries</a>.
***

In this notebook, we will start to explore the fundamental concept of data visualization. Data visualization is a powerful tool that allows us to bring data to life by presenting it in visual formats that are easy to comprehend and interpret.


## Objectives

By the end of this notebook, you will be able to:

1. Understand why data visualization matters and learn about different types of plots commonly used.
2. Discover how to create interactive and engaging data visualizations.

## 1. Importance of Data Visualization

**Data visualization** is the process of representing information, data, or numerical values visually through charts, graphs, maps, and other visual elements. By transforming complex datasets into easily understandable and visually appealing representations, data visualization enables users to grasp patterns, trends, and insights that might be difficult to discern from raw data alone. This visual approach allows for quicker understanding and more effective communication of the underlying information, making it a valuable tool for data analysis, decision-making, and storytelling in various fields, including business, science, finance, and more.

**Why data visualization is so important?**

Data visualization is essential as it brings numerous benefits to data analysis and decision-making:

* *Better Understanding*: Visualizations help people grasp complex data easily, revealing patterns and trends that might be hidden in raw data.

* *Spotting Patterns*: Visualizations quickly identify patterns and outliers, aiding analysts in finding critical insights and anomalies.

* *Effective Communication*: By presenting data visually, it becomes easier to communicate insights to a broader audience, even those without data analysis expertise.

* *Informed Decision-Making*: Visualizations allow decision-makers to assess multiple variables simultaneously, leading to more data-driven and informed decisions.

* *Interactive Exploration*: Interactive visualizations enable users to explore data from different angles, promoting deeper understanding and effective information extraction.

In conclusion, data visualization unlocks data's true potential, simplifying understanding, supporting decision-making, and fostering effective communication.

## 2. Different Types of Plot

### 2.1 Types of Data

Data can be broadly categorized into two main types: categorical/qualitative data and numerical/quantitative data.

1. **Categorical/Qualitative Data**
   - Categorical data describes characteristics or attributes and cannot be measured on a numerical scale.
   - It includes nominal data, where categories have no inherent order (e.g., colors, gender).
   - It also includes ordinal data, where categories have a specific order (e.g., survey ratings like "low," "medium," and "high").
   
2. **Numerical/Quantitative Data**
   - Numerical data represent measurable quantities and are expressed on a numerical scale.
   - It can be further divided into discrete data, where values are specific and separate (e.g., the number of children in a family).
   - It also includes continuous data, where values can take any matter within a range (e.g., temperature, height).
   
Considering the nature of the data is essential for choosing appropriate statistical measures and visualizations to analyze and communicate the information effectively. Additionally, data can have **temporal** or **spatial** characteristics, which further influence the choice of visual representations to convey insights and relationships within the dataset best.

### 2.2 Types of Plot

In data visualization, we have a wide range of plot types to choose from, such as comparison plots, relation plots, composition plots, distribution plots, and geoplots.


#### 2.2.1 Comparison Plots

Comparison plots are useful for comparing multiple variables or tracking variables over time. Line charts effectively visualize variables over time. Bar charts, also known as column charts, are excellent for comparing items, while vertical bar charts are suitable for a smaller time period with fewer than 10 data points. Additionally, radar charts, also called spider plots, are a great option for visualizing multiple variables across multiple groups.

**(1) Line Charts**

Line charts are powerful tools for visualizing quantitative values over a continuous time period. They present information as a series of data points connected by straight-line segments, making it easy to track trends and changes over time.

In a line chart:

* The y-axis represents the value being measured.
* The x-axis represents the timescale.

Line charts are incredibly versatile and find great use in various scenarios:

* Comparing Multiple Variables: Line charts are perfect for comparing multiple variables and understanding how they evolve over time.

* Visualizing Trends: They excel at visualizing trends for both single and multiple variables, providing valuable insights into data patterns.

* Suitable for Extended Time Periods: Line charts are ideal when dealing with datasets that span many time periods, providing a comprehensive view of data trends.

However, for smaller time periods with fewer than 10 data points, vertical bar charts might be a better alternative.

<div style="text-align:center">
    <p style="font-weight: bold; font-size:12px; text-align:center">Line Chart</p>
    <img src="images/05/line_charts.png" width="150">
    <p style="font-style:italic; font-size:12px; text-align:center">Image Source: <a href="https://www.flaticon.com/free-icon/line-graph_1270360" style="text-decoration:none; color:inherit;">flaticon</a></p>
</div>

Design Practices for Line Charts:

* Limit the number of lines in a chart.
* Adjust the scale for better visibility of trends.
* Use a legend for plots with multiple variables to describe each variable.

**(2) Bar Charts**

Bar charts are simple and effective visualizations that use the length of bars to represent values. There are two types: vertical bar charts and horizontal bar charts.

Bar charts are commonly used to compare numerical values across different categories. Vertical bar charts can also show how a single variable changes over time.

When using bar charts:

* Avoid confusing them with histograms. Bar charts compare different variables or categories, while histograms show the distribution of a single variable.
* Don't use bar charts to display central tendencies among groups or categories. Instead, use box plots or violin plots to show statistical measures or distributions in those cases.

<div style="text-align:center">
    <p style="font-weight: bold; font-size:12px; text-align:center">Bar Chart</p>
    <img src="images/05/bar_chart_vertical.png" width="150"><img src="images/05/bar_chart_horizontal.png" width="150">
    <p style="font-style:italic; font-size:12px; text-align:center">Image Source: <a href="https://icon-icons.com/icon/bar-chart/183416" style="text-decoration:none; color:inherit;">Icon-Icons</a></p>
</div>

Design Practices for Bar Charts:

* Always start the numerical axis from zero to avoid confusion and exaggeration of differences.
* Use horizontal labels when there are only a few bars and the chart looks clear.
* Rotate the labels to different angles if the horizontal presentation becomes too crowded, as shown in the example.

**(3) Radar Charts**

Radar charts, also called spider or web charts, show multiple variables on their axes, creating a polygon shape. All axes start from the center with equal distances between them and use the same scale.

Radar charts are perfect for comparing multiple quantitative variables within one group or across multiple groups. They are also great for visualizing which variables have high or low scores in a dataset, making them useful for assessing performance.

<div style="text-align:center">
    <p style="font-weight: bold; font-size:12px; text-align:center">Radar Chart</p>
    <img src="images/05/radar_chart.png" width="150">
    <p style="font-style:italic; font-size:12px; text-align:center">Image Source: <a href="https://www.flaticon.com/free-icon/analytic-graph_5932724" style="text-decoration:none; color:inherit;">flaticon</a></p>
</div>

Design Practices for Radar Charts:

* Limit radar charts to show 10 factors or fewer for easier readability.
* Use faceting, displaying each variable in a separate plot, especially for multiple variables or groups. This ensures clarity in the visualization, as demonstrated in the example diagram.


#### 2.2.2 Relation Plots

Relation plots are great for showing relationships between variables. A scatter plot is used to visualize the correlation between two variables for one or multiple groups. If you want to show relationships involving three variables, a bubble plot is used, where the dot size represents the additional third variable. Heatmaps are excellent for revealing patterns or correlations between two qualitative variables. For displaying the correlation among multiple variables, a correlogram is a perfect visualization.

**(1) Scatter Plots**

Scatter plots are graphs that show data points for two numerical variables, with one variable on each axis.

Scatter plots help us see if there is a relationship (correlation) between two variables. They can also display the relationship between multiple groups or categories by using different colors. Bubble plots, a type of scatter plot, are great for visualizing correlations involving a third variable.

<div style="text-align:center">
    <p style="font-weight: bold; font-size:12px; text-align:center">Scatter Plot</p>
    <img src="images/05/scatter_plot.png" width="150">
    <p style="font-style:italic; font-size:12px; text-align:center">Image Source: <a href="https://www.flaticon.com/free-icon/scatter-plot_7665284" style="text-decoration:none; color:inherit;">flaticon</a></p>
</div>


Design Practices for Scatter Plots:

* Make sure both axes start at zero to show the data accurately.
* Use different colors for data points in scatter plots, and avoid using symbols when you have multiple groups or categories.

**(2) Bubble Plots**

A bubble plot is a variation of a scatter plot that introduces a third numerical variable. In this plot, the size of the dots represents the value of the third variable. The larger the dot, the higher the value. To understand the exact numerical value, a legend is provided, linking the dot size to its corresponding measurement.

Bubble plots are useful for displaying the correlation between three variables.

<div style="text-align:center">
    <p style="font-weight: bold; font-size:12px; text-align:center">Bubble Plot</p>
    <img src="images/05/bubble_graph.png" width="150">
    <p style="font-style:italic; font-size:12px; text-align:center">Image Source: <a href="https://www.flaticon.com/free-icon/bubble-graph_5712030" style="text-decoration:none; color:inherit;">flaticon</a></p>
</div>

Design Practices for Bubble Plots:

* The same design practices used for scatter plots apply to bubble plots as well.
* Avoid using bubble plots for very large datasets, as having too many bubbles can make the chart challenging to read.

**(3) Correlograms**

A correlogram combines scatter plots and histograms to show the relationship between numerical variables. It helps visualize how variables are related to each other. The diagonal of the correlation matrix shows the distribution of each variable with histograms. 

Correlograms are valuable for exploring data and understanding how different variables correlate with each other.

<div style="text-align:center">
    <p style="font-weight: bold; font-size:12px; text-align:center">Correlogram</p>
    <img src="images/05/correlogram.png" width="400">
    <p style="font-style:italic; font-size:12px; text-align:center">Image Source: <a href="https://python-graph-gallery.com/111-custom-correlogram/" style="text-decoration:none; color:inherit;">Python Graph Gallery</a></p>
</div>

Design Practices for Correlogram:

* Always start both axes at zero to show the data accurately.
* Use different colors for data points and avoid using symbols when dealing with scatter plots that have multiple groups or categories.

**(4) Heatmaps**

A heatmap uses colors to show values in a matrix. It's great for visualizing multivariate data, where rows and columns represent categories, and colors represent a numerical or categorical variable.

Heatmaps are useful for finding patterns in complex data with multiple variables. They help to analyze and understand relationships within the dataset.

<div style="text-align:center">
    <p style="font-weight: bold; font-size:12px; text-align:center">Heatmap</p>
    <img src="images/05/heatmap.png" width="150">
    <p style="font-style:italic; font-size:12px; text-align:center">Image Source: <a href="https://icon-library.com/icon/heatmap-icon-22.html" style="text-decoration:none; color:inherit;">Free Icons Library</a></p>
</div>

Design Practice for Heatmap:

* Choose colors and contrasts that are easily visible to individuals with vision problems to ensure your plots are more inclusive.


#### 2.2.3 Composition Plots

Composition plots are used to show parts of a whole. For static data, you can use pie charts, stacked bar charts, or Venn diagrams. Pie charts and donut charts are good for showing the proportions and percentages of different groups. If you need to add another dimension, use stacked bar charts. Venn diagrams are best for visualizing overlapping groups, where each group is represented by a circle. For data that changes over time, you can use either stacked bar charts or stacked area charts.

**(1) Pie Charts**

Pie charts show numerical proportions by dividing a circle into slices. Each slice represents a proportion of a category, with the full circle representing 100%. While pie charts are commonly used to compare parts of a whole, it is often easier for people to compare bars in bar charts or stacked bar charts instead.

<div style="text-align:center">
    <p style="font-weight: bold; font-size:12px; text-align:center">Pie Chart</p>
    <img src="images/05/pie_chart.png" width="150">
    <p style="font-style:italic; font-size:12px; text-align:center">Image Source: <a href="https://www.freepik.com/free-icon/pie-chart_14008990.htm" style="text-decoration:none; color:inherit;">freepik</a></p>
</div>

Design Practices for Pie Charts:

* Arrange the slices based on their size, either from smallest to largest or largest to smallest, in a clockwise or counterclockwise manner.
* Use different colors for each slice to make them easily distinguishable and clear.

**(2) Donut Charts**

A donut chart is an alternative to a pie chart. It is easier to compare the size of slices in a donut chart as readers focus on the length of the arcs instead of the area. Donut charts are more space-efficient because the center is cut out, allowing for additional information or subgroup divisions.

<div style="text-align:center">
    <p style="font-weight: bold; font-size:12px; text-align:center">Donut Chart</p>
    <img src="images/05/donut_chart.png" width="150">
    <p style="font-style:italic; font-size:12px; text-align:center">Image Source: <a href="https://www.flaticon.com/free-icon/pie-chart_3589902" style="text-decoration:none; color:inherit;">flaticon</a></p>
</div>

Design Practice for Donut Charts:

* Use the same color for subcategories as the main category but make them brighter or darker to distinguish them effectively. This helps create a clear and visually appealing visualization.

**(3) Stacked Bar Charts**

Stacked bar charts show how a category is divided into subcategories and the proportion of each subcategory in comparison to the whole category. You can compare the total amounts in each bar or show the percentage of each group. The latter is known as a 100% stacked bar chart and helps visualize relative differences between quantities in each group.

Use stacked bar charts to compare variables that have subcategories.

<div style="text-align:center">
    <p style="font-weight: bold; font-size:12px; text-align:center">Stacked Bar Chart</p>
    <img src="images/05/stacked_bar_chart.png" width="150">
    <p style="font-style:italic; font-size:12px; text-align:center">Image Source: <a href="https://icon-icons.com/icon/stacked-bar-chart/183431" style="text-decoration:none; color:inherit;">Icon-Icons</a></p>
</div>

Design Practices for Stacked Bar Charts:

* Use different colors for stacked bars to make them stand out visually.
* Leave enough space between the bars to avoid overcrowding. The recommended spacing between each bar is half the width of a bar.
* Organize data alphabetically, sequentially, or by value to create a clear and consistent order for better understanding by your audience.

**(4) Stacked Area Charts**

Stacked area charts display trends for part-of-a-whole relationships. The values of multiple groups are represented by stacking individual area charts on top of one another. This chart allows for the analysis of both individual and overall trend information.

Stacked area charts are used to show trends over time that are part of a whole.

<div style="text-align:center">
    <p style="font-weight: bold; font-size:12px; text-align:center">Stacked Area Chart</p>
    <img src="images/05/area_chart.png" width="150">
    <p style="font-style:italic; font-size:12px; text-align:center">Image Source: <a href="https://icon-icons.com/icon/area-chart/183308" style="text-decoration:none; color:inherit;">Icon-Icons</a></p>
</div>

Design Practice for Stacked Area Charts:

* To enhance information visibility, use transparent colors. This allows you to analyze overlapping data and view the grid lines more easily.

**Venn Diagrams**

Venn diagrams, also called set diagrams, illustrate all possible logical relations among a finite collection of distinct sets. Each set is represented by a circle, and the size of the circle indicates the importance of the group. The size of the overlapping areas represents the intersection between multiple groups.

Venn diagrams are used to show the overlaps between different sets.

<div style="text-align:center">
    <p style="font-weight: bold; font-size:12px; text-align:center">Venn Diagrams</p>
    <img src="images/05/venn_diagram.png" width="150">
    <p style="font-style:italic; font-size:12px; text-align:center">Image Source: <a href="https://www.flaticon.com/free-icon/venn-diagram_2106632" style="text-decoration:none; color:inherit;">flaticon</a></p>
</div>

Design Practice for Venn Diagrams:

* Avoid using Venn diagrams when you have more than three groups, as it can make the visualization challenging to understand.

#### 2.2.4 Distribution Plots

Distribution plots provide valuable insights into how your data is distributed. For a single variable, a histogram is an effective choice. For multiple variables, you can use either a box plot or a violin plot. The violin plot shows the densities of your variables, while the box plot displays the median, interquartile range, and range for each variable.

**(1) Histograms**

A histogram shows the distribution of a single numerical variable. Each bar represents the frequency for a specific interval, allowing you to see where values are concentrated and identify outliers. You can choose to plot the histogram with absolute frequency values or normalize it. When comparing distributions of multiple variables, use different colors for the bars.

Histograms provide insights into the distribution of a dataset, helping you understand the data's patterns and trends.

<div style="text-align:center">
    <p style="font-weight: bold; font-size:12px; text-align:center">Histogram</p>
    <img src="images/05/histogram.png" width="150">
    <p style="font-style:italic; font-size:12px; text-align:center">Image Source: <a href="https://www.flaticon.com/free-icons/histogram" style="text-decoration:none; color:inherit;">flaticon</a></p>
</div>

Design Practice for Histograms:

* Experiment with different numbers of bins (data intervals) for the histogram, as it can significantly impact the shape and representation of the data.

**(2) Density Plots**

A density plot displays the distribution of a numerical variable. It is a variation of a histogram that uses kernel smoothing, resulting in smoother distributions. One advantage of density plots over histograms is their ability to better determine the distribution shape, as histograms' shape heavily depends on the number of bins (data intervals).

Use density plots to compare the distribution of several variables by plotting their densities on the same axis and using different colors. This allows for easy visual comparison of distributions.

<div style="text-align:center">
    <p style="font-weight: bold; font-size:12px; text-align:center">Density Plot</p>
    <img src="images/05/density.png" width="150">
    <p style="font-style:italic; font-size:12px; text-align:center">Image Source: <a href="https://www.flaticon.com/free-icon/density_9156326" style="text-decoration:none; color:inherit;">flaticon</a></p>
</div>

Design Practice for Density Plots:

* Use different and contrasting colors to plot the density of multiple variables. This helps make the visualization clear and easy to understand.

**(3) Box Plots**

A box plot displays several statistical measurements. The box represents the interquartile range (IQR), extending from the lower to the upper quartile values of the data. The horizontal line within the box represents the median. The whiskers, extending from the boxes, indicate the variability outside the lower and upper quartiles. Additionally, data outliers can be shown as circles or diamonds beyond the end of the whiskers.

Box plots are used to compare statistical measures for multiple variables or groups, helping to understand the data distribution and variability.

<div style="text-align:center">
    <p style="font-weight: bold; font-size:12px; text-align:center">Box Plot</p>
    <img src="images/05/box_plot.png" width="150">
    <p style="font-style:italic; font-size:12px; text-align:center">Image Source: <a href="https://www.flaticon.com/free-icon/box-plot_6754360" style="text-decoration:none; color:inherit;">flaticon</a></p>
</div>

**(4) Violin Plots**

Violin plots combine box plots and density plots, showing both statistical measures and data distribution. The thick black bar in the center represents the interquartile range, similar to a box plot's whiskers. The white dot indicates the median and the density is visualized on both sides of the centerline.

Violin plots are used to compare statistical measures and data density for multiple variables or groups, providing a comprehensive understanding of the data distribution.

<div style="text-align:center">
    <p style="font-weight: bold; font-size:12px; text-align:center">Violin Plot</p>
    <img src="images/05/violin_plot.jpg" width="150">
    <p style="font-style:italic; font-size:12px; text-align:center">Image Source: <a href="https://synergycodes.com/glossary/what-is-violin-plot/" style="text-decoration:none; color:inherit;">synergy codes</a></p>
</div>

Design Practice for Violin Plots:

* Scale the axes correctly so that the data distribution is clearly visible and not flat. This ensures an accurate representation of the data's characteristics and patterns.

#### 2.2.5 Geoplots

Geospatial plots are excellent for visualizing geospatial data. Choropleth maps are useful for comparing quantitative values across different countries, states, or regions. If you need to display connections between various locations, connection maps are the preferred choice.

**(1) Dot Maps**

In a dot map, each dot represents a certain number of observations. The dots are of the same size and value, showing the magnitude of the data. You can use different colors or symbols to represent multiple categories or groups.

Dot maps are used to visualize geospatial data, making it easier to understand patterns across locations.

<div style="text-align:center">
    <p style="font-weight: bold; font-size:12px; text-align:center">Dot Map</p>
    <img src="images/05/dot_map.png" width="200">
    <p style="font-style:italic; font-size:12px; text-align:center">Image Source: <a href="https://datavizproject.com/data-type/dot-density-map/" style="text-decoration:none; color:inherit;">DVP</a></p>
</div>

Design Practices for Dot Maps:

* Avoid overcrowding the dot map with too many locations. Keep the map clear for better understanding.
* Choose a dot size and value that allows dots in dense areas to blend together, providing a clear impression of the spatial distribution.

**(2) Choropleth Maps**

In a choropleth map, each tile is colored to represent a specific variable. It shows how a variable changes across a geographic area. To make the map visually accurate, consider normalizing the data based on the map's area.

Choropleth maps are used to visualize geospatial data, showing variable variations across geological regions, such as states or countries.

<div style="text-align:center">
    <p style="font-weight: bold; font-size:12px; text-align:center">Choropleth Map</p>
    <img src="images/05/choropleth_map.png" width="200">
    <p style="font-style:italic; font-size:12px; text-align:center">Image Source: <a href="https://www.storybench.org/how-to-build-an-interactive-choropleth-map-with-barely-any-code/" style="text-decoration:none; color:inherit;">storybench</a></p>
</div>

Design Practices for Choropleth Maps:

* Use darker colors for higher values, as they are perceived as having a higher magnitude.
* Limit the color gradation to around seven shades, as the human eye has difficulty distinguishing between too many colors.

**(3) Connection Maps**

In a connection map, each line represents a certain number of connections between two locations. The lines have the same thickness and value, showing the magnitude of connections. You can use different colors for lines to represent multiple categories or groups, or use a colormap to encode connection length.

Connection maps are used to visualize connections between locations, making it easier to understand relationships and patterns of connectivity.

<div style="text-align:center">
    <p style="font-weight: bold; font-size:12px; text-align:center">Connection Map</p>
    <img src="images/05/connection_map.png" width="200">
    <p style="font-style:italic; font-size:12px; text-align:center">Image Source: <a href="https://datavizproject.com/data-type/connection-map/" style="text-decoration:none; color:inherit;">DVP</a></p>
</div>


Design Practices for Connection Maps:

* Avoid showing too many connections to keep the map clear and manageable for analysis.
* Choose a line thickness and value that allows lines to blend in dense areas, providing a clear impression of the spatial distribution.

In the end, we want to provide you with some key design tips for effective data visualization:

1. *Use Colors to Differentiate*: Employ colors to distinguish between different variables or subjects as they are easier to see than symbols.

2. *Enhance 2D Plots with Multiple Variable*s: To show extra variables on a 2D plot, vary color, shape, and size.

3. *Keep It Simple*: Avoid overloading your visualization with too much information. Keep it straightforward for better clarity and understanding.

## 3. Interactive Data Visualization

In today's data-driven landscape, presenting complex data effectively is essential to understand the most sophisticated statistical analyses. That's where interactive data visualization becomes invaluable. It offers an intuitive and dynamic approach to exploring, analyzing, and interpreting data, allowing us to uncover meaningful insights. Through interactive visualizations, we can make informed decisions, drive innovation, and gain a deeper understanding of the data's implications. 

In this section, we will learn how to create interactive plots using the Bokeh library. By the end, you'll know the entire process of building applications with Bokeh.

### 3.1 Introduction of Bokeh  <img src='images/05/bokeh.png' align='right' width=90>

Bokeh is a modern and interactive visualization library designed for web browsers. The plots created in this Bokeh are based on JavaScript widgets. Bokeh makes it easy to create visually appealing plots and graphs with minimal styling. Additionally, it allows us to build interactive dashboards based on large static datasets or even streaming data. Some key Features of Bokeh are,

* *Simple and Complex Visualizations*: Bokeh caters to users of all skill levels, offering easy and customizable visualizations.
* *Excellent Animated Visualizations*: Bokeh handles large datasets, making it great for animated visuals and data analysis.
* *Inter-Visualization Interactivity*: Bokeh allows the creation of impactful dashboards with interconnected plots.
* *Multiple Language Support*: Bokeh supports Python and JavaScript, making it accessible to different users.
* *Various Interactivity Options*: Bokeh provides various ways to add interactivity to visualizations, like zooming and filtering.
* *Beautiful Chart Styling*: Bokeh creates stunning plots without extensive custom styling, using Tornado and D3 in its tech stack.

If you want to learn more about Bokeh, please visit its [website](https://bokeh.org/).

The following are some key concepts related to Bokeh:

* *Application*: A Bokeh application refers to a rendered Bokeh document that operates within a web browser.
* *Glyphs*: The fundamental elements of Bokeh, glyphs comprise lines, circles, rectangles, and other shapes visible on a Bokeh plot.
* *Server*: Utilized for sharing and publishing interactive plots and apps to a specific audience, the Bokeh server is an essential component.
* *Widgets*: Bokeh widgets encompass sliders, drop-down menus, and other small tools that can be embedded into your plot to introduce interactivity.

For more detail, see the official [documentation](https://docs.bokeh.org/en/latest/docs/user_guide/intro.html)

### 3.2 Basic Plotting

**(1) Simple Plot**

In Bokeh, graphs are built gradually, one step at a time. We start with a figure, which acts as the canvas, and then we add elements called `glyphs` to it. `Glyphs` are similar to shapes in `ggplot` and are like separate layers on the graph. Depending on what we want, `glyphs` can take different forms like circles, lines, bars, and more.

In [2]:
# Import libraries
from bokeh.plotting import figure
from bokeh.io import show, output_notebook

# Create a blank figure with labels
p = figure(plot_width = 500, plot_height = 500, 
           title = 'Example Glyphs',
           x_axis_label = 'X', y_axis_label = 'Y')

# Example data
squares_x = [1, 3, 4, 5, 8]
squares_y = [8, 7, 3, 1, 10]
circles_x = [9, 12, 4, 3, 15]
circles_y = [8, 4, 11, 6, 10]

# Add squares glyph
p.square(squares_x, squares_y, size = 12, color = 'navy', alpha = 0.6)
# Add circle glyph
p.circle(circles_x, circles_y, size = 12, color = 'red')

# Set to output the plot in the notebook
output_notebook()
# Show the plot
show(p)

In the above code, we make a simple chart with squares and circles. First, we create a plot using the `figure` method, and then we add the shapes (glyphs) to the plot using specific methods and data. Finally, we show the plot, and here, we are using a Jupyter Notebook to view the plots directly below the code with the output_notebook call.

As you see, Bokeh provides some additional tools on the right side of the plot for free. These tools include panning, zooming, selection, and the ability to save the plot. These tools are customizable and prove useful when we need to explore our data.

**(2) Plot with Dataset**

In this section, we are going to plot the barley yield by using the crop yield dataset which you can download from the [One World in Data](https://ourworldindata.org/crop-yields#explore-data-on-crop-yields) website.

In [3]:
import pandas as pd

# Read the data from a csv into a dataframe
crop_yield = pd.read_csv('data/crop_yields.csv', index_col=0)

crop_yield.shape

(8804, 206)

In [4]:
crop_yield.info()

<class 'pandas.core.frame.DataFrame'>
Index: 8804 entries, Afghanistan to Zimbabwe
Columns: 206 entries, year to wheat_yield_gap
dtypes: float64(205), int64(1)
memory usage: 13.9+ MB


Verify that our data has been successfully loaded by calling `head` on our DataFrame.

In [5]:
crop_yield.head()

Unnamed: 0_level_0,year,abaca__manila_hemp__raw_yield,agave_fibres__raw__n_e_c_yield,almond_yield,apples_yield,apricots_yield,areca_nuts_yield,artichokes_yield,asparagus_yield,avocados_yield,...,potato_yield_gap,rapeseed_yield_gap,rice_yield_gap,rye_yield_gap,sorghum_yield_gap,soybean_yield_gap,sugarbeet_yield_gap,sugarcane_yield_gap,sunflower_yield_gap,wheat_yield_gap
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Afghanistan,1961,,,,6.8018,6.639,,,,,...,37.143303,,7.841,,,,66.0224,,0.4042,2.978
Afghanistan,1962,,,,6.8018,6.639,,,,,...,38.143303,,7.841,,,,61.216705,,0.3078,3.0265
Afghanistan,1963,,,,6.8018,6.639,,,,,...,37.6767,,7.841,,,,62.391903,,0.3078,3.1683
Afghanistan,1964,,,,7.8298,7.6863,,,,,...,37.210003,,7.6327,,,,69.1529,,0.3078,3.049
Afghanistan,1965,,,,8.2258,8.0819,,,,,...,37.010002,,7.6327,,,,64.01,,0.2596,3.0277


In this example, we just want to plot the barley yield in the United States.

In [6]:
import numpy as np

# Drop the missing (NaN) values from the 'barley_yield' column
crop_yield = crop_yield[~np.isnan(crop_yield['barley_yield'])]

# Create a dictionary with the data
data = {
    'country': crop_yield.index.tolist(),
    'year': crop_yield['year'].tolist(),
    'barley_yield': crop_yield['barley_yield'].tolist()
}

# Create the DataFrame from the dictionary
barley_yield = pd.DataFrame(data)

# Filter the 'barley_yield' DataFrame to get data for the United States
us_barley_yield = barley_yield[barley_yield['country'] == 'United States']

In [7]:
# Summary stats for the column of interest
barley_yield['barley_yield'].describe()

count    4860.000000
mean        2.444108
std         1.596174
min         0.093700
25%         1.111100
50%         2.141850
75%         3.341600
max        10.380899
Name: barley_yield, dtype: float64

**Bar Chart**

We'll use the `bokeh` library `vbar()` function to make a bar chart showing the barley yield in the United States.

In [8]:
from bokeh.plotting import figure
from bokeh.io import show, output_notebook

"""
plotting the barly yield in Unit States
"""
# Create a Bokeh figure
output_notebook()  # To display the plot in the notebook (Jupyter)

p = figure(plot_height=400, plot_width=600, title="Barley Yield in the United States")

# Create the bar plot
p.vbar(x=us_barley_yield['year'], top=us_barley_yield['barley_yield'],
       width=0.7, color='red', legend_label='United States')

# Set the labels and titles
p.xaxis.axis_label = "Year"
p.yaxis.axis_label = "Barley Yield"
p.title.align = "center"
p.legend.location = "top_left"

# Show the plot
show(p)

**Line Chart**

Meantime, we will also use the `bokeh` library `line()` function to make a line chart showing the barley yield in the United States.

In [9]:
# Create a Bokeh figure
output_notebook()  # To display the plot in the notebook (Jupyter)

p = figure(plot_height=400, plot_width=600, title="Barley Yield in the United States")

# Create the line plot
p.line(x=us_barley_yield['year'], y=us_barley_yield['barley_yield'], 
       line_width=2, color='blue', legend_label='United States')

# Set the labels and titles
p.xaxis.axis_label = "Year"
p.yaxis.axis_label = "Barley Yield"
p.title.align = "center"
p.legend.location = "top_left"

# Show the plot
show(p)

### 3.3 Adding Interactivity

Bokeh offers two types of interactions: passive and active. Passive interactions, also known as inspectors, allow users to closely examine a plot without changing the displayed information. On the other hand, active interactions directly modify the data shown in the plot. These interactions encompass actions such as selecting a subset of the data or adjusting the degree of a polynomial regression fit.

**(1) Passive Interactions**

In [10]:
from bokeh.plotting import figure, show
from bokeh.io import output_notebook
from bokeh.models import HoverTool

# Create a Bokeh figure
output_notebook()  # To display the plot in the notebook (Jupyter)

p = figure(plot_height=400, plot_width=600, title="Barley Yield in the United States")

# Create the bar plot
p.vbar(x=us_barley_yield['year'], top=us_barley_yield['barley_yield'], width=0.7, color='red',
       hover_fill_alpha = 1.0, hover_fill_color = 'navy')

# Set the labels and titles
p.xaxis.axis_label = "Year"
p.yaxis.axis_label = "Barley Yield"
p.title.align = "center"

# Add a hover tool
hover = HoverTool()
hover.tooltips = [('Year', '@x'), ('Barley Yield', '@top{0.00}')]
p.add_tools(hover)

# Show the plot
show(p)

In the Bokeh style, we build our chart by adding elements to the original figure. In the code example provided, we use the p.vbar glyph function to create a bar plot, and it includes some additional parameters such as "hover_fill_alpha" and "hover_fill_color," which change the appearance of the bar when the mouse hovers over it.

We also add interactivity to the plot by utilizing the HoverTool instance. The HoverTool allows us to define tooltips that display specific information when hovering over data points. We pass a list of tooltips as Python tuples, where the first element represents the label for the data, and the second element references the data we want to highlight. In this case, we use '@x' to reference the x-position (year) and '@top{0.00}' to reference the y-position (barley yield) of the data points. The '{0.00}' format specifier ensures that the barley yield is displayed with two decimal places.

The final plot is presented below:

<div style="text-align:center">
    <img src="images/05/example.png" width="600">
</div>


**(2) Active Interactions**

In Bokeh, various types of [active interactions](https://docs.bokeh.org/en/latest/docs/user_guide/interaction.html) exist for users to interact with plots. In this section, we'll concentrate on a particular type called "widgets." These are elements you can click on to control various aspects of the plot.

In this section, we will explore various widgets and learn how to utilize them in building interactive plots. We'll start with a basic plot using one of these widgets. The section also covers different update trigger options for the widgets. The table below provides an overview of the widgets: 

| <sub>Value</sub><br> | <sub>Widget</sub><br> | <sub>Example</sub><br> 
| :-----------: | :------------: | :-----------: |
| <sub>Boolean</sub> | <sub>Checkbox</sub> | <sub>False</sub> 
| <sub>String</sub> | <sub>Text</sub> | <sub>'Input Text'</sub>
| <sub>Int value, Int range</sub> | <sub>IntSlider</sub> | <sub>5, (0, 10), (4, 16, 2)</sub> 
| <sub>Float value, Float range</sub> | <sub>FloatSlider</sub> | <sub>1.0, (0.0, 10.0), (0.1, 2.1, 0.5)</sub> 
| <sub>list or Dict</sub> | <sub>Dropdown</sub> | <sub>['Option1', 'Option2'], ['one':1, 'two':2]</sub> 

Below is the code example showcasing how to add basic widgets in Python.

In [11]:
# Importing the widgets
from ipywidgets import interact, interact_manual

In [12]:
# Create a checkbox 
@interact(Value=False)
def checkbox(Value=False):
    print(Value)

interactive(children=(Checkbox(value=False, description='Value'), Output()), _dom_classes=('widget-interact',)…

In [13]:
# Create an input text
@interact(Value='Input Text')
def text_input(Value):
    print(Value)

interactive(children=(Text(value='Input Text', description='Value'), Output()), _dom_classes=('widget-interact…

In [14]:
# Create a float slider 0.5 steps with manual update trigger
@interact_manual(Value=(0.0, 11.0))
def slider(Value=0.0):
    print(Value)

interactive(children=(FloatSlider(value=0.0, description='Value', max=11.0), Button(description='Run Interact'…

In [15]:
# Multiple widgets with default layout
options=['Option1', 'Option2', 'Option3', 'Option4']
@interact(Select=options, Display=False)
def uif(Select, Display):
    print(Select, Display)

interactive(children=(Dropdown(description='Select', options=('Option1', 'Option2', 'Option3', 'Option4'), val…

Now, let's explore an example using our own dataset. In this exercise, we'll utilize widgets to create a simple plot displaying the barley yield for different countries in specific years. The barley yield will be adjustable using a drop-down menu and a slider bar.

**To begin, we need to prepare our dataset and select the specific column that we require.**

In [16]:
import pandas as pd
import numpy as np

# Read the data from a csv into a dataframe
crop_yield = pd.read_csv('data/crop_yields.csv', index_col=0)

# Drop the missing (NaN) values from the 'barley_yield' column
crop_yield = crop_yield[~np.isnan(crop_yield['barley_yield'])]

# Create a dictionary with the data
data = {
    'country': crop_yield.index.tolist(),
    'year': crop_yield['year'].tolist(),
    'barley_yield': crop_yield['barley_yield'].tolist()
}

# Create the DataFrame from the dictionary
barley_yield = pd.DataFrame(data)

barley_yield.head()

Unnamed: 0,country,year,barley_yield
0,Afghanistan,1961,1.08
1,Afghanistan,1962,1.08
2,Afghanistan,1963,1.08
3,Afghanistan,1964,1.0857
4,Afghanistan,1965,1.0857


**Next, let's add weight to the plot and see different countries' barley yield**, you can use the ipywidgets library to create an interactive dropdown widget. This widget will allow you to select different countries, and the plot will update accordingly to show the barley yield for the selected country.

In [17]:
from bokeh.io import push_notebook, show
from bokeh.plotting import figure
from bokeh.models import ColumnDataSource, HoverTool
import ipywidgets as widgets
from ipywidgets import interact

# Assuming you have a DataFrame called 'barley_yield' with columns 'country', 'year', and 'barley_yield'
# Replace this with your actual DataFrame

# Function to update the plot based on the selected country
def update_plot(country):
    # Get the data for the selected country
    country_data = barley_yield[barley_yield['country'] == country]

    # Create a Bokeh figure
    p = figure(plot_height=400, plot_width=700, title=f"Barley Yield in {country}")

    # Create the line plot
    p.line(x='year', y='barley_yield', source=ColumnDataSource(country_data), line_width=2)

    # Set the labels and titles
    p.xaxis.axis_label = "Year"
    p.yaxis.axis_label = "Barley Yield"
    p.title.align = "center"

    # Add hover tool
    hover = HoverTool()
    hover.tooltips = [("Year", "@year"), ("Barley Yield", "@barley_yield")]
    p.add_tools(hover)

    # Show the plot
    show(p)

# Create the dropdown widget for selecting the country
country_dropdown = widgets.Dropdown(
    options=list(barley_yield['country'].unique()),
    description='Select Country:',
)

# Call the update_plot function whenever the dropdown selection changes
widgets.interactive(update_plot, country=country_dropdown)

interactive(children=(Dropdown(description='Select Country:', options=('Afghanistan', 'Albania', 'Algeria', 'A…

In the above code, the `update_plot` function takes the selected country as an argument and plots the barley yield data for that country using the `bokeh` library. The `widgets.interactive` function connects the dropdown widget to the update_plot function, so whenever you select a different country from the dropdown, the plot will update to show the barley yield for that country.

**But how can we enhance the visualization to display and compare the barley yield of multiple countries?**

To allow for selecting multiple countries, you can use the `widgets.SelectMultiple` widget from `ipywidgets`. With this widget, you can choose multiple countries from a list, enabling you to display and compare the barley yield of multiple countries simultaneously. Here's how you can make the necessary modifications to the code to achieve this capability.

In [18]:
from bokeh.plotting import figure, show
from bokeh.io import output_notebook
from bokeh.models import HoverTool, ColumnDataSource
from bokeh.palettes import Category20
import ipywidgets as widgets

# Assuming you have a DataFrame called 'barley_yield' with columns 'country', 'year', and 'barley_yield'
# Replace this with your actual DataFrame

# Function to update the plot based on the selected countries
def update_plot(countries):
    # Convert the numpy array of countries to a regular Python list
    countries_list = list(countries)

    # Filter the DataFrame to include only the selected countries
    country_data = barley_yield[barley_yield['country'].isin(countries_list)]

    # Create a Bokeh figure
    p = figure(plot_height=400, plot_width=700, title="Barley Yield in Selected Countries")

    # Flatten the Category20 palette and get the colors for the selected countries
    num_colors_needed = len(countries_list)
    colors = Category20[max(3, num_colors_needed)][:num_colors_needed]

    # List to store individual ColumnDataSources for each country
    data_sources = []

    # Create a line plot for each selected country
    for country, color in zip(countries_list, colors):
        country_data_single = country_data[country_data['country'] == country]
        source = ColumnDataSource(data=country_data_single)
        p.line(x='year', y='barley_yield', source=source, line_width=2, color=color, legend_label=country)
        data_sources.append(source)

    # Set the labels and titles
    p.xaxis.axis_label = "Year"
    p.yaxis.axis_label = "Barley Yield"
    p.title.align = "center"

    # Add hover tool
    hover = HoverTool()
    hover.tooltips = [("Year", "@year"), ("Barley Yield", "@barley_yield")]
    p.add_tools(hover)

    # Show the plot
    show(p)

# Create the dropdown widget for selecting the countries
country_dropdown = widgets.SelectMultiple(
    options=list(barley_yield['country'].unique()),
    description='Select Countries:'
)

# Call the update_plot function whenever the dropdown selection changes
widgets.interactive(update_plot, countries=country_dropdown)

interactive(children=(SelectMultiple(description='Select Countries:', options=('Afghanistan', 'Albania', 'Alge…

Once you select multiple countries from the dropdown, each country's line will be displayed in a unique color, and the corresponding country names will be visible in the legend. The legend automatically assigns different colors to different countries, making it easier to distinguish them in the plot.

In the provided code, the `Category20 palette` from Bokeh is utilized to generate distinct colors for each selected country. The `zip` function is employed to pair each country with its corresponding color. As we have 97 countries and the `Category20 palette` offers 20 distinct colors, the code cycles through the palette multiple times to ensure enough colors are available for all countries.

**Last, we want to add more control values for the plot, such as the range of years to display**, you can include additional widgets like sliders to allow the user to set those values interactively. Here's the code that includes sliders for controlling the range of years to display.

In [19]:
from bokeh.plotting import figure, show
from bokeh.io import output_notebook
from bokeh.models import HoverTool, ColumnDataSource
import ipywidgets as widgets
from bokeh.palettes import Category20

# Assuming you have a DataFrame called 'barley_yield' with columns 'country', 'year', and 'barley_yield'
# Replace this with your actual DataFrame

# Function to update the plot based on the selected countries and range of years
def update_plot(countries, start_year, end_year):
    # Convert the numpy array of countries to a regular Python list
    countries_list = list(countries)

    # Filter the DataFrame to include only the selected countries and range of years
    country_data = barley_yield[(barley_yield['country'].isin(countries_list)) & 
                                (barley_yield['year'] >= start_year) & 
                                (barley_yield['year'] <= end_year)]

    # Create a Bokeh figure
    p = figure(plot_height=400, plot_width=700, title="Barley Yield in Selected Countries")

    # Flatten the Category20 palette and get the colors for the selected countries
    num_colors_needed = len(countries_list)
    colors = Category20[max(3, num_colors_needed)][:num_colors_needed]

    # List to store individual ColumnDataSources for each country
    data_sources = []

    # Create a line plot for each selected country
    for country, color in zip(countries_list, colors):
        country_data_single = country_data[country_data['country'] == country]
        source = ColumnDataSource(data=country_data_single)
        p.line(x='year', y='barley_yield', source=source, line_width=2, color=color, legend_label=country)
        data_sources.append(source)

    # Set the labels and titles
    p.xaxis.axis_label = "Year"
    p.yaxis.axis_label = "Barley Yield"
    p.title.align = "center"

    # Add hover tool
    hover = HoverTool()
    hover.tooltips = [("Year", "@year"), ("Barley Yield", "@barley_yield")]
    p.add_tools(hover)

    # Show the plot
    show(p)

# Create the dropdown widget for selecting the countries
country_dropdown = widgets.SelectMultiple(
    options=list(barley_yield['country'].unique()),
    description='Select Countries:'
)

# Create sliders for selecting the range of years
start_year_slider = widgets.IntSlider(
    value=min(barley_yield['year']),
    min=min(barley_yield['year']),
    max=max(barley_yield['year']),
    description='Start Year:',
    continuous_update=False
)

end_year_slider = widgets.IntSlider(
    value=max(barley_yield['year']),
    min=min(barley_yield['year']),
    max=max(barley_yield['year']),
    description='End Year:',
    continuous_update=False
)

# Arrange the widgets on the left side using VBox layout
left_side_widgets = widgets.VBox([country_dropdown, start_year_slider, end_year_slider])

# Call the update_plot function whenever the dropdown selection or sliders change
widgets.interactive(update_plot, countries=country_dropdown, start_year=start_year_slider, end_year=end_year_slider)

interactive(children=(SelectMultiple(description='Select Countries:', options=('Afghanistan', 'Albania', 'Alge…

By creating the `start_year_slider` and `end_year_slider`, you can now control the range of years to display in the plot in addition to selecting the countries to compare. The plot will update automatically whenever you change the selections using the dropdown and sliders.

### 3.4 Dashboard with Bokeh

In this section, we will not cover building the dashboard with Bokeh. However, if you are interested in creating your own dashboard, you can refer to the article titled "[Data Visualization with Bokeh in Python, Part III: Making a Complete Dashboard](https://medium.com/towards-data-science/data-visualization-with-bokeh-in-python-part-iii-a-complete-dashboard-dc6a86aa6e23)." This article provides a comprehensive guide on how to build a complete dashboard using Bokeh. 

## 4. Conclusion

In conclusion, this notebook has provided a solid foundation for understanding the fundamental concepts of data visualization. We have learned about the significance of data visualization and its role in presenting data in an easily understandable and interpretable manner.

Throughout this notebook, we explored various types of plots commonly used in data visualization and gained insights into creating interactive and engaging visualizations. Understanding how to effectively visualize data is crucial in conveying complex information and patterns to the audience.

As you continue your data visualization journey, you will be equipped with the knowledge and tools including matplotlib, seaborn, and geoplotlib. These tools will help you create compelling visual representations of data, enabling you to make informed decisions and communicate insights effectively.