# Exploring Data Visualization with Altair

**Objective**: By the end of this workshop, you'll be able to create a variety of visualizations using Altair, understand the principles of Altair’s declarative syntax, and how to explore data interactively.

**Duration**: Approx. 3 hours

**Target Audience**: This workshop is designed for individuals who have a basic understanding of Python and are interested in exploring data visualization, and are already familiar with pandas.


## 1. What is Altair?

Altair adopts a declarative approach to data visualization. Instead of providing a series of procedural instructions to create a visualization, you describe the desired outcome in a declarative manner. This approach makes it easier to understand what the code is intended to do, and often results in more concise code

**Importance**: Altair’s declarative nature makes it a powerful tool for rapidly exploring visualizations, making it an excellent choice for exploratory data analysis (EDA). It’s also built on a solid foundation of theorems about visualization, ensuring that the visualizations it creates are effective for communicating insights.

**Use Cases**: Altair is widely used in the data science community for tasks such as EDA, creating interactive dashboards with tools like [**Streamlit**](https://streamlit.io/), and producing publication-quality visualizations.

## 2. Setup

### 2.1 Installing Altair

To install Altair, you can use `pip` or `conda`:

In [53]:
# option 1: install altair and vega_datasets using pip
# !pip install altair seaborn

# option 2: install altair and vega_datasets using conda
# !conda install -c conda-forge altair seaborn

## 2.2 Importing Libraries:

Import Altair: Show how to import Altair into the Jupyter Notebook.

In [54]:
import altair as alt # to create the charts
import seaborn as sns # to load the data
import pandas as pd # to work with the data

## 2.3 Verify Installation:

Verify that Altair is installed correctly by importing it and checking the version number:

In [55]:
# Verify Altair installation
print("Altair version: ", alt.__version__)
print("Seaborn version: ", sns.__version__)
print("Pandas version: ", pd.__version__)

Altair version:  5.1.2
Seaborn version:  0.13.0
Pandas version:  2.1.1


## 2.4 Dataset Loading:

Load the dataset into a pandas DataFrame:

In [80]:
penguins = sns.load_dataset("penguins")
penguins.head()

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex
0,Adelie,Torgersen,39.1,18.7,181.0,3750.0,Male
1,Adelie,Torgersen,39.5,17.4,186.0,3800.0,Female
2,Adelie,Torgersen,40.3,18.0,195.0,3250.0,Female
3,Adelie,Torgersen,,,,,
4,Adelie,Torgersen,36.7,19.3,193.0,3450.0,Female


## 3. Basic Concepts

As I mentioned before, Altair is built around a declarative visualization design, which means you declare links between data columns to visual properties, and Altair takes care of the rest.

### Declarative vs. procedural

In programming, there are two main approaches: procedural and declarative. The procedural approach is like being a chef who follows a recipe step by step, where you need to detail every action to get to the final dish. This method requires you to explicitly outline each step of the process, making it clear but often lengthy.

On the flip side, the declarative approach, which Altair adopts, is like being a diner at a restaurant. You just state what dish you want, and the kitchen staff takes care of the rest. In Altair, you specify what you want the visualization to look like, and the library figures out how to create it. For example, you tell Altair you want a scatter plot of height against weight, and Altair does the rest.

This declarative approach makes creating visualizations with Altair straightforward and predictable. You don't have to worry about the how, just the what. And this often results in less code, making your work more readable and easier to manage.

So, with Altair, you get to be the diner who enjoys a gourmet meal without having to sweat it out in the kitchen!

In [57]:
import altair as alt
import pandas as pd

# Create a simple dataset
data = pd.DataFrame({
    'x': ['A', 'B', 'C', 'D', 'E'],
    'y': [3, 5, 2, 8, 7]
})

# Create a bar chart
bar_chart = (
    alt # use altair
    .Chart(data) # use the data
    .mark_bar() # create a bar chart
    .encode( # map the data to the chart
        x='x', # map the 'x' column to the x-axis
        y='y' # map the 'y' column to the y-axis
    )
)

bar_chart

## 3.2 Some Basic Chart Types

In [58]:
# scatter plot
# great for showing the relationship between two variables

# create dataframe
df = pd.DataFrame({
    'x': [1, 2, 3, 4, 5],
    'y': [3, 5, 2, 8, 7]
})

# create scatter plot
scatter_plot = (
    alt
    .Chart(df)
    .mark_circle(
        size=100 # set the size of the circles
        )
    .encode(
        x='x',
        y='y'
    )
)

scatter_plot

In [59]:
# line chart
# great for time series data or data that is ordered by x-axis values

# create dataframe
df = pd.DataFrame({
    'x': [1, 2, 3, 4, 5],
    'y': [3, 5, 2, 8, 7]
})

# create line chart
line_chart = (
    alt
    .Chart(df)
    .mark_line(
        color='red' # change the color of the line
        )
    .encode(
        x='x',
        y='y'
    )
)

line_chart

You can check out the [Altair Gallery](https://altair-viz.github.io/gallery/index.html) for a full list of supported chart types. Here are some of the most common ones:

1. Bar Chart: Ideal for comparing categorical data.
2. Line Chart: Suitable for displaying trends over a continuous range or time period.
3. Scatter Plot: Perfect for showing relationships between two numerical variables.
4. Histogram: Useful for displaying frequency distributions.
5. Box Plot: Provides a summary of the data distribution through quartiles.
6. Area Chart: Useful for showing cumulative totals or summations.
7. Heatmap: Ideal for displaying relationships between two categorical variables.
8. Violin Plot: Combines aspects of box plots and histograms to provide a summary of data distributions.
9. Pie Chart: Represents categorical data as a proportion of a whole.
10. Ridgeline Plot: Useful for comparing distributions among different categories.

### Combining Charts

In Altair, you can combine multiple charts into a single visualization using the `+` operator. For example, you can combine a scatter plot with a line chart to show the relationship between two variables and a trend line.

In [60]:
# Combine bar and line chart
combined_chart = (line_chart + scatter_plot)
combined_chart

## 4. Theory and Examples

4.1 Theory: Encoding Channels in Altair

In Altair, encoding channels are the bridge between data variables and visual properties of a chart. They define how data columns map to the properties of visual marks. For example, in a scatter plot, two encoding channels might map data variables to the x and y positions of points, while another encoding channel might map a data variable to the color or size of points.

In [61]:
# Import necessary libraries
import altair as alt
import pandas as pd

# Load or create a dataset
data = pd.DataFrame({
    'x': [1, 2, 3, 4, 5],
    'y': [5, 4, 3, 2, 1],
    'category': ['A', 'B', 'A', 'B', 'A']
})

# Example 1: Basic X and Y Encoding
basic_chart = alt.Chart(data).mark_point().encode(
    x='x',
    y='y'
)
basic_chart.display()

In Example 1, we see a simple scatter plot where the `x` and `y` encoding channels map data variables to the x and y positions of points.

In [62]:
# Example 2: Color Encoding
color_chart = alt.Chart(data).mark_point().encode(
    x='x',
    y='y',
    color='category'
)
color_chart.display()

In Example 2, we introduce a `color` encoding channel to color points by their `category` values, showcasing how we can convey additional information through color.

In [63]:
# Example 3: Size Encoding
size_chart = alt.Chart(data).mark_point().encode(
    x='x',
    y='y',
    size='y'
)
size_chart.display()

In Example 3, the `size` encoding channel maps the `y` value to the size of the points, demonstrating how size can be used to represent a data variable visually.

## 5. Challenge yourself

Now you are going to use the dataset penguins to challenge yourself and create some visualizations.

In [64]:
import seaborn as sns
import altair as alt

# Load the penguins dataset
penguins = sns.load_dataset('penguins')

# Explore the dataset
penguins.head()

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex
0,Adelie,Torgersen,39.1,18.7,181.0,3750.0,Male
1,Adelie,Torgersen,39.5,17.4,186.0,3800.0,Female
2,Adelie,Torgersen,40.3,18.0,195.0,3250.0,Female
3,Adelie,Torgersen,,,,,
4,Adelie,Torgersen,36.7,19.3,193.0,3450.0,Female


### 5.2 Guided Exercises

#### Exercise 1: Basic Bar Chart

Create a basic bar chart showing the average flipper length for each species of penguin.

To accomplish this, you'll need to look into aggregating data in Altair. You can find more information about aggregation in the [Altair documentation](https://altair-viz.github.io/user_guide/encodings/index.html#encoding-aggregates).

In [74]:
# Your code here

#### Hint:

Use the `mark_bar()` method for creating a bar chart

#### Solution:

In [70]:
# Create a basic bar chart showing the average flipper length for each species of penguin.

alt.Chart(penguins).mark_bar().encode(
    x='species',
    y='mean(flipper_length_mm)'
)

### Exercise 2: Scatter Plot 

Create a scatter plot comparing flipper length and body mass, colored by species.

In [73]:
# Your code here

#### Hint:

Use the `mark_point()` method for creating a scatter plot

#### Solution:

In [67]:
alt.Chart(penguins).mark_point().encode(
    x='flipper_length_mm',
    y='body_mass_g',
    color='species'
)

### Exercise 3: Histogram

Create a histogram of body mass for each species.

In [72]:
# Your code here

#### Hint:

Use the `mark_bar()` method for creating a bar chart, and consider using the `bin=True` parameter in the `X` encoding to create a histogram. Don't forget to use the `color` encoding to differentiate between species.

#### Solution:

In [71]:
# Example Solution:
alt.Chart(penguins).mark_bar().encode(
    alt.X('body_mass_g', bin=True),
    y='count()',
    color='species'
)

### Exercise 3: Box Plor

Create a box plot of flipper length for each species.

In [None]:
# Your code here

#### Hint:

Use the `mark_boxplot()` method to create a box plot. You will need to specify the x and y encoding channels to represent the species and flipper length, respectively.

#### Solution:

In [75]:
# Example Solution:
alt.Chart(penguins).mark_boxplot().encode(
    x='species',
    y='flipper_length_mm'
)

## 6. Advanced Concepts

### 6.1 Interactivity

Altair provides a simple way to add interactivity to your visualizations. You can enable interactivity by adding the `interactive()` method to your chart. This will add a set of controls to the visualization that allow you to pan, zoom, and save the chart.

In [76]:
import altair as alt
import seaborn as sns

# Load the penguins dataset from seaborn
penguins = sns.load_dataset('penguins')

# Create an interactive scatter plot
interactive_scatter = alt.Chart(penguins).mark_point().encode(
    x='flipper_length_mm',
    y='bill_depth_mm',
    color='species',
    tooltip=['species', 'island']
).interactive()

interactive_scatter

### 6.2 Multi-panel Charts

Altair makes it easy to create multi-panel charts. You can use the `facet()` method to create a multi-panel chart, where each panel shows a subset of the data. For example, you can create a multi-panel chart showing the relationship between two variables for each species of penguin.

In [83]:
# Create a multi-panel scatter plot
multi_panel_scatter = alt.Chart(penguins).mark_point().encode(
    x='flipper_length_mm',
    y='bill_depth_mm',
    color='species',
).facet(
    column='sex',
    row='island'
)

multi_panel_scatter


## 8. Additional Resources

Official Documentation: Altair has a well-maintained [official documentation](https://altair-viz.github.io/altair-tutorial/README.html). Here you'll find a thorough explanation of Altair's features, along with numerous examples.