# Altair Notes
### Alex Knowlton
This notebook is compiled after taking notes and following along with Jake Vanderplas' excellent talk at the 2018 pycon, given with Jupyter Notebooks. It follows his implementations and assorted tips and tricks to learning the grammar of Altair.
The talk can be found here: [Jupyter Talks](https://www.youtube.com/watch?v=ms29ZPUKxbU&t=2654s). The notes and conclusions are mine, but much of the code is his. My research also included looking into mapping, which his talk did not cover, so all of the maps and the experiments are mine, in addition to tweaking some of the charts to be more understandable.  
  
For the documentation and more examples of interaction and other graphs, see the [documentation](https://altair-viz.github.io/gallery/index.html)  


First, we need to import the libraries we will be using. The `vega_datasets` library has a bunch of sample datasets that are excellent for practicing vizualizations. I will be mainly using three, in addition to one more that I will import from the USGS website. The first is the cars dataset from vega, the second is the iris dataset that we used for machine learning. The third is a spatial dataset with the world's countries, in topoJSON format (a sort of compacted geoJSON).

In [None]:
import altair as alt
import pandas as pd
import geopandas as gpd
import numpy as np
from vega_datasets import data
cars = data.cars()
iris = data.iris()

In [None]:
cars.head()

In [None]:
iris.head()

## Creating Simple Charts
core concepts - data (usually tabular), marks, and encodings
encoding types - Quantitative, Nominal, Time Series, and Ordinal

Format for a typical chart - this is boilerplate that we will use over and over again, so it's a good idea to have a look at it now.
```python
chart = alt.Chart(data_name).mark_point().encode(
    x='column1:Type',
    y='column2:Type'
)
```
the `alt.Chart(data)` creates a `Chart` object, which is like a fancy dictionary, but which we can then plot. However, we also have to make sure to specify how we want `altair` to mark the data - point, circle, rectangle, etc., using a variant of `.mark` - `mark_circle`, `mark_point`, etc.

We then *encode* the data, which we will see here in a moment.
In this next example, we have marked it as a circle and then gone straight to a `dict` to see what we're dealing with.

In [None]:
alt.Chart(cars.head()).mark_circle().to_dict()

this chart produces something rather boring:

In [None]:
alt.Chart(cars).mark_circle()

Well, this kind of sucks. What's the point of that?
Well, `altair` actually did plot each row of the `cars` dataset - flat on top of each other, like pancakes. Now, we have to specify how we want them to be encoded. This means we have to specify how the data should be related to its position on the chart. We do this by using `encode()`

In [None]:
alt.Chart(cars).mark_point().encode(
    x='Horsepower:Q',
    y='Acceleration:Q',
    color='Origin:N',
    shape='Origin:N'
)

And now we have a scatterplot. Notice how we specify the `x` and `y` encodings - we first specify the column of the `DataFrame`, then the type of the data. In this case, we specified 'Q', for 'quantitative'. Another way to do this would be to use `alt.X('Horsepower', type='quantitative)`. The shorthand to it is a `String` in the format, `'col_name:Type'`, where `Type` is the letter corresponding to the type of data in the column.  
  
To learn more about the options for encodings and the different options for `encode()`, please visit the [encoding documentation](https://altair-viz.github.io/user_guide/encoding.html#).   
  
  We can also aggregate columns, using another shorthand syntax. For example, if we wanted to make a bar chart, we would encode one side as categorical data and the other as a count of whatever was in the other column. We will also have to mark the data as a bar instead of a point.

In [None]:
alt.Chart(cars).mark_bar().encode(
    y='Origin',
    x='count()'
)

We can also specify color in the encoding to make it a stacked bar chart.

In [None]:
alt.Chart(cars).mark_bar().encode(
    y='Origin',
    x='count()',
    # try encoding it as 'nominal' instead!
    color=alt.Color('Cylinders', type='ordinal')
)

There are a **ton** of different aggregation functions, including count, mean, median, like we've seen, but also more arcane functions, like q1 and std.

Another important topic is binning, which we can use to create histograms. To create a histogram, we do something almost identical to our bar chart, except we pass in a binned encoding for `y` (remember, histograms are for quantitative data, so we will use a different column).

In [None]:
alt.Chart(cars).mark_bar().encode(
    x=alt.X('Acceleration', bin=True),
    y='count()'
)

## Creating multiple charts and transforms
To have side-by-side `altair` plots, we can use the `vconcat` and `hconcat` functions, or we can use the `&` and `|` symbols to do the same thing. We can also use `alt.layer` to put multiple plots on the same panel, with a shortcut of `+`. This is alternately called 'concatenation'.  
Additionally, we can use something called a `transform` to give us a polynomial line of best fit. This next example has a scatterplot and a line of best fit, with a linked barplot right below it.  
A `transform` is a way to somehow change the data that we are visualizing. It's often something we can do with `pandas`, such as a join or a grouping, but we can also calculate regressions, and do conditional filters, which are hard to do with `pandas`.

In [None]:
base = alt.Chart(cars).mark_point().encode(
    x='Miles_per_Gallon',
    y='Horsepower',
    color='Origin:N'
)

regression = base.transform_regression(
    'Horsepower', 'Acceleration').mark_line().encode(
        color=alt.value('black')
    )

hist = alt.Chart(cars).mark_bar().encode(
    y='Origin',
    x='count()',
    color='Origin'
)

(base + regression) & hist

There are a lot of transforms out there - most of them can be easily done with `pandas`, so there isn't much point to rehashing the same thing here, but a regression transform is valuable to know, since it's tricky to do with `pandas`.

## Recap So Far
We have learned about the declarative syntax of `altair`, which means we tell it what we want to do, instead of how to do it like we do with `matplotlib`. We've learned how to make a bunch of different charts and graphs, and how to layer and concatenate different graphs easily. The one exception is pie charts, which (for some reason) `altair` can't seem to do.  
  
Now let's take a look at how to make interactive plots.

## Making `altair` plots interactive
At this point, we're able to do almost everything we can do with `matplotlib` and `seaborn`. We've seen that the grammar is kind of nice with `altair`, but at this point, there's no real point to switching. However, the real strength of `altair` is that it can be made interactive. This is the really killer part of this library.  
  
There are three basic selections that we can allow a user to do:
- selection interval
- multi selection
- single selection  

And there are four basic things you can do with these selections:
- Conditional encodings
- Scales
- Filters
- Domains  

We won't cover scales and domains in so much detail, since they're a bit trickier to deal with than conditions and filters, but we will touch on them. There are also some miscellaneous things we can do, such as adding a *tooltip* - a signal to the chart to show some data when the user hovers over a part of the chart, and a simple `.interactive()` command that allows panning and zooming.

## Selections
An interval selection is like a sliding window that the user can create to slide over the chart. When they do, this sliding window fires off signals to the page that some points are being hovered over and some are not, which we can then use to change around our plots. These sliding windows are often called *brushes*. We can also encode the interval to only go in 1 axis.  
For example, this is a chart in which there is a brush (click and drag to experiment). There is also a tooltip that will show the name of the car when we hover over the data point.

In [None]:

selection = alt.selection_interval(translate=True, encodings=['x']) # translate allows us to drag the interval

alt.Chart(cars).mark_point().encode(
    x='Horsepower:Q',
    y='Miles_per_Gallon:Q',
    color='Origin',
    tooltip='Name'
).properties(
    selection=selection
)

This doesn't really mean much. However, we can use the signal fired off by the sliding window to add a conditional encoding - we can say, for example, that the points should only be colored by the origin if they are inside the interval. There are also a bunch of different options for the selection, some of which are below. `empty` allows us to specify which points should be colored when there is no interval, and zoom allows us to zoom the window in and out.

In [None]:
# translate allows us to drag the interval
selection = alt.selection_interval(translate=True, encodings=['x', 'y'],
                                   empty='none', zoom=True)

alt.Chart(cars).mark_point().encode(
    x='Horsepower:Q',
    y='Miles_per_Gallon:Q',
    color=alt.condition(selection, 'Origin', alt.value('lightgray')),
    tooltip='Name'
).properties(
    selection=selection
)

We don't have to use an interval selection - we can also use single selections. This allows us to select a single point and use that as a condition for conditional encoding. We can also specify that selection - does it select on click, on mouseover, etc.  
  
One other fun thing we can to is make the graph interactive, which means we can pan and zoom, so our selection becomes a little more useful.

In [None]:
selection = alt.selection_single(empty='none', on='mouseover')

alt.Chart(cars).mark_circle(size=100).encode(
    x='Horsepower:Q',
    y='Miles_per_Gallon:Q',
    color=alt.condition(selection, 'Origin', alt.value('lightgray')),
    tooltip='Name'
).properties(
    selection=selection
).interactive()

Multi selection is almost exactly like single selection, except you can select multiple individual points by holding `shift` when you click to toggle points on and off.

In [None]:
selection = alt.selection_multi(empty='none', on='mouseover')
alt.Chart(cars).mark_circle(size=100).encode(
    x='Horsepower:Q',
    y='Miles_per_Gallon:Q',
    color=alt.condition(selection, 'Origin', alt.value('lightgray')),
).properties(
    selection=selection
)

The single and multi selections don't do a lot for us here, but they are particularly useful for bar charts, less so for scatterplots.  
  
Now that we know a little about selections, let's take a look at what we can do with them.

## Scales
One cool thing with altair is that the graph itself can act as a filter! If you consider the axes of the graph, couldn't you create a selection that would fire when the graph is zoomed in and out?  
It turns out that this is what `interactive` does - it does something called a 'binding' - it 'binds' the scales on the plot to a selection, which you can then use as a selection **on different plots**!  
(This is also *time-series* data, which is kind of like a scatterplot over time)  
Examine this graph. We first create a time-series graph of vertical rules. Then, we create a new chart based on that where the scale is bound to a selection interval. Then, we simply add the selection interval to another, shorter version of the base chart.

In [None]:
interval = alt.selection_interval(encodings=['x'])
WIDTH = 800
weather = data.seattle_weather()
base = alt.Chart(weather).mark_rule().encode(
    x='date:T',
    y='temp_min:Q',
    y2='temp_max:Q',
    color='weather:N'
)

chart = base.properties(
    width=WIDTH,
    height=300
).encode(
    x=alt.X('date:T', scale=alt.Scale(domain=interval.ref()))
)

view = base.properties(
    width=WIDTH,
    height=50,
    selection=interval
)

chart & view

## Experiment #1: Single selection barplot filter
At this point, the possibilities are endless. Let's look at an example where we use the selection interval to filter out the data. Suppose we want to see data about the iris dataset, and we want to be able to highlight a particular species.

In [None]:
# As a reminder of what the iris dataset looks like
iris.head()

In [None]:
bar_selector = alt.selection_multi(encodings=['color'])
scatter = alt.Chart(iris).mark_point().encode(
    x='sepalLength:Q',
    y='petalLength:Q',
    color='species:N'
).transform_filter(
    filter=bar_selector
).interactive()

bar = alt.Chart(iris).mark_bar().encode(
    y='species:N',
    x='mean(petalLength):Q',
    color=alt.condition(bar_selector, 'species', alt.value('lightgray'))
).properties(
    selection=bar_selector
)

scatter & bar

## Experiment #2: Interactive Legend
What if, instead of having a barplot, we could make a legend that was interactive? We'll basically need to think of the legend as a separate plot, then we can make it from scratch. We'll use some of the same code as last time:  
The only thing we will do differently is change the **scale** of the plot axes, since the last graph was sort of difficult to see. What this does is allow us to set the domains so that our data is in the middle of the plot, not off to one side. We do this by using `alt.Scale` and passing in a `domain` variable of a `tuple` with max and min values.

In [None]:
legend_selector = alt.selection_multi(encodings=['color'])
scatter = alt.Chart(iris).mark_point().encode(
    x=alt.X('sepalLength:Q',
            scale=alt.Scale(domain=(4, 8))),
    y=alt.Y('sepalWidth:Q',
            scale=alt.Scale(domain=(1.75, 4.5))),
    color=alt.Color('species:N', legend=None),
    tooltip=['species:N']
).transform_filter(
    filter=legend_selector
).interactive()

legend = alt.Chart(iris).mark_point().encode(
    y='species:N',
    color=alt.condition(legend_selector, 'species', alt.value('lightgray'))
).properties(
    selection=legend_selector
)

(legend | scatter)

## Experiment #3: Multiple Linked Graphs and Filters
Let's take the previous graph, and just for the sake of experimentation (and fun), let's add **strip plots** of the `x` and `y` axes that will update based on what's selected. For this, we will have to break it down into multiple steps:
1. Decide our variables
2. Create our selectors
3. Create `x` and `y` stripplots using `mark_tick`
4. Add filters to stripplots
5. Add plot and legend from above
6. Add selection interval to scatterplot
7. Concatenate in the correct order

In [None]:
# Step 1, in case we want to change things later
x_variable = 'sepalLength:Q'
y_variable = 'sepalWidth:Q'
standard_dist = 400

# Step 2
strip_selector = alt.selection_interval(translate=True)
legend_selector = alt.selection_single(encodings=['color'])

# Step 3
x_stripplot = alt.Chart(iris).mark_tick().encode(
    y='species:N',
    x=alt.X(x_variable,
            scale=alt.Scale(domain=(4, 8))),
    color=alt.Color('species:N', legend=None)
).transform_filter( # Step 4
    filter=strip_selector
).transform_filter(
    filter=legend_selector
).properties(
    height=100,
    width=standard_dist
)

# We can also reuse plots and re-encode them!
y_stripplot = x_stripplot.encode(
    x='species:N',
    y=alt.Y(y_variable,
            scale=alt.Scale(domain=(1.75, 4.5)))
).properties(
    width=100,
    height=standard_dist
)

# Step 5
scatter = alt.Chart(iris).mark_point().encode(
    x=alt.X(x_variable,
            scale=alt.Scale(domain=(4, 8))),
    y=alt.Y(y_variable,
            scale=alt.Scale(domain=(1.75, 4.5))),
    color=alt.condition(legend_selector and strip_selector, 'species', alt.value('lightgray')),
    tooltip=['species:N']
).transform_filter(
    filter=legend_selector
).properties(
    height=standard_dist,
    width=standard_dist,
    selection=strip_selector # Step 6
)

legend = alt.Chart(iris).mark_point().encode(
    y='species:N',
    color=alt.condition(legend_selector, 'species', alt.value('lightgray'))
).properties(
    selection=legend_selector
)

# Step 7
legend | ((scatter | y_stripplot) & x_stripplot)

Basically, we can add filters and selections however we want, wherever we want. It can get tricky to stop it from moving around, though.

## Maps
So far, we've been able to do pretty much everything we can do with `matplotlib`, except for spatial data. Fortunately, `altair` can do that, too. We use `mark_geoshape()`, and we have to pass it in some sort of data, which we can do with a `GeoDataFrame`, as well. However, this takes forever, so we will try to use a topoJSON file and just convert it with `altair`, instead. The downside to this is that, if we want to do a join, we will have to use another transform, `transform_lookup`. This is basically the same thing as a join, so it's no trouble.  

In [None]:
counties = alt.topo_feature(data.world_110m.url, 'countries')
projections = ['equirectangular', 'mercator', 'orthographic', 'albers',
               'albersUsa', 'stereographic']
alt.Chart(counties).mark_geoshape().project(
    type=projections[0] # try different projections to see what happens!
).properties(
    width=500,
    height=300
)

## Experiment #4: In-depth combined time-series with map
Hopefully you've seen how we can do a bunch of different things with selections, maps, intervals, filters, and most of the stuff we can do with `matplotlib`. But now, how can we combine aspects from all of this to create a new map, one that links between spatial and tabular data, with interactive filters? What's the limit?  
  
For the sake of the example, let's suppose we're studying earthquakes, and we want an interactive chart to see what's been happening in the US with regard to earthquakes (you could totally do this with COVID by the way, and you don't have to do it with individual numbers either, but I digress). Our goal is to create a linked-brush, multi-panel time-series plot that is linked to a map, so we can kind of see what's going on over time in particular areas.  
Earthquake data retrieved from the [USGS geoJSON summary feed](https://earthquake.usgs.gov/earthquakes/feed/v1.0/geojson.php)  
  
From what we've seen, there are a few steps we need to follow, and some things we have to decide:
1. Get the data we need. This means we need the earthquake data with latitude, longitude, magnitude, and time.
2. Create a time-series plot of the data and add an x-encoded selection interval to it.
3. Create the base world map.
4. Plot the Earthquakes on top of the world map.
5. Filter the earthquakes to be only the ones that are in the selection interval
6. Since tooltips don't work very well with concatenated charts, we'll have to make our own, so we need to plot the magnitudes on the chart as well, then add a hover filter to that chart to only display the magnitude of the quake that we are hovering over.

First, we'll fetch the data. I'm going to use the world map from `vega_datasets` as spatial data, and I know that comes in `topoJSON` format, so I'll have to use `altair`'s `topo_feature` function to extract it. Next, I need quake data, which I will pull from the USGS earthquake feed. I will then restrict the columns of the earthquake data to only those we need.
After that, it's pretty straightforward - parse the time into `DateTime` using `pandas`, create a `mag_squared` column for plotting (more on this later), and extract the latitude and longitude from the `geometry` column (more on this later too). Then, I'll filter out any magnitudes that are less than 0, because that would throw things off, and nobody really cares about quakes that small anyway.

In [None]:
states = alt.topo_feature(data.world_110m.url, 'countries')
quakes = gpd.read_file('https://earthquake.usgs.gov/earthquakes/feed/v1.0/summary/all_week.geojson')
quakes = quakes[['id', 'mag', 'time', 'geometry']]
quakes['time'] = quakes['time'].apply(lambda time: pd.to_datetime(time, unit='ms'))
quakes['mag_squared'] = np.power(quakes['mag'], 4)
quakes['latitude'] = quakes['geometry'].apply(lambda point: point.y)
quakes['longitude'] = quakes['geometry'].apply(lambda point: point.x)
quakes = quakes[quakes['mag'] > 0]

Now, we can create our time series chart. It's pretty similar to what we've already done, but I'm going to rename the axis labels to something more readable. I can also reset the height and width, and add my selection interval. Also remember to save it in a variable, since we need to plot it with other stuff later.

In [None]:
selection = alt.selection_interval(encodings=['x'], zoom=False)

time_series = alt.Chart(quakes).mark_bar().encode(
    x=alt.X('time:T', axis=alt.Axis(title='Date and Time')),
    y=alt.Y('mag:Q', axis=alt.Axis(title='Magnitude')),
    color=alt.condition(selection, alt.value('steelblue'), alt.value('lightgray'))
).properties(
    width=1000,
    height=100,
    selection=selection
)

time_series

Next, we create our basic map using the `world_110m` dataset, and we can add lines to it using a `graticule`. We have to specify a projection, just like before, so I'm going to use `equirectangular`, since that's the one that's kind of the most standard-looking.

In [None]:
base_states = alt.Chart(states).mark_geoshape(
    fill='lightgray',
    stroke='white'
).properties(
    width=1000,
    height=470
).project('equirectangular').properties(
    title='Earthquakes of the World in the Last Week'
)

# use alt.graticule() to give us latitude and longitude lines
lines = alt.Chart(alt.graticule()).mark_geoshape(
    stroke='lightgray',
    strokeWidth=0.5
)

base_states + lines

Next, we create our quake chart. In this case, we're actually going to use `mark_circle`, since that will allow us to encode the `size`, but we'll just have to encode `latitude` and `longitude` too. We will also add a filter that comes from the time series, and add a selection that we will use to add our custom tooltip. Remember, passing `empty='none'` to our selection means that when there is no selection, nothing is shown.  
  
Also, I encode the size as `mag_squared`, which just shows more of the difference between earthquakes. The richter scale is actually logarithmic, so a magnitude 5 earthquake is actually 10 times greater than a magnitude 4 quake, so this shows the difference a little better.

In [None]:
hover_point_selection = alt.selection_single(empty='none', on='mouseover')

quake_points = alt.Chart(quakes).mark_circle().encode(
    latitude='latitude:Q',
    longitude='longitude:Q',
    size=alt.Size('mag_squared:Q', legend=None),
    color=alt.value('steelblue')
).transform_filter(
    filter=selection
).properties(
    selection=hover_point_selection
)

Now we can add our tooltip chart by using `mark_text`, which is new, but it's pretty similar to everything else. We just encode `latitude` and `longitude` just like the circles, but we also have to specify what text to show. We will also pass a `dx` parameter to move it over a bit, or it would hover right on top of the quake, and we want to move it over.

In [None]:
quake_text = alt.Chart(quakes).mark_text(
    dx=20
).encode(
    latitude='latitude:Q',
    longitude='longitude:Q',
    text='mag:Q'
).transform_filter(
    filter=hover_point_selection
)

Now, all we have to do is put it all together. We use the shortcuts `+` and `&` to layer and vertically concatenate the charts.

In [None]:
lines + base_states + quake_points + quake_text & time_series

## In Review
Altair is awesome for simply creating graphs, and has a very simple grammar of interaction so that you can combine multiple graphs and selections into truly impressive vizualizations.  
I hope this was informative, thanks for reading!