# Visualization in Altair 

As you have already seen in the lab, Altair is one of the recent Python visualization libraries.

While `matplotlib` is one of the most popular visualization libraries (out of the whole [universe of alternatives](https://speakerdeck.com/jakevdp/pythons-visualization-landscape-pycon-2017?slide=36)), in this course we will be using Altair, because it allows us to focus on _what_ we would like to do, instead of _how_ to do it (the beauty of its [declarative nature](https://speakerdeck.com/jakevdp/pythons-visualization-landscape-pycon-2017?slide=80)).

Altair is built on Vega-Lite, which provides specifications for how to render interactive graphics in the browser. (Read more about it: [Vega-Lite: A Grammar of Interactive Graphics](https://eitanlees.github.io/altair-stack/)).

## Goals for this session

Use Altair to 
* graph a simple function
* graph more than one function on the same plot
* use data type encodings 
    * review why type encoding is important
    * use shorthand and long-form
* customize a chart by changing
    * axes labels
    * font sizes
    * chart title
    * legend
* visualize a real-world dataset

In [None]:
import altair as alt
import numpy as np
import pandas as pd

## Getting / creating a dataframe to visualize

Altair is able to work with [different types of objects](https://altair-viz.github.io/user_guide/data.html). The most common type, which makes the encoding relatively straightforward is a Pandas DataFrame.

In order to create an Altair chart, we usually would start by getting or creating a DataFrame. In this example, let's 
* generate integer values we'll call `i` and store them in an array `num`,
* generate the `f(i)` values for the `sin` function.

In [None]:
num = np.arange(100) # range of x values

sample_data = pd.DataFrame({
  'i': num,
  'f(i)': np.sin(num / 10)
})

Note that `i` and `f(i)` refer to the columns in the dataframe called `data`. The array to the right of the colon `:` stores the values in the corresponding column (one value per row).

## Creating a chart

Now, we can use the new DataFrame that we created (`sample_data`) as the input to Altair.

The structure for the visualization is always the same: the visualization has to start with `alt.Chart` with the DataFrame that we want to visualize given as an input: e.g, `alt.Chart(data)`.

Next, we need to decide what kind of visualization we want to create. The options are accessible using the `mark_*` methods. Some of the options include
* `mark_point()`: scatter plot
* `mark_line()`: line chart
* `mark_bar()`: bar chart / histogram

See the Example Gallery (https://altair-viz.github.io/gallery/index.html) for a selection of examples.

To create our visualization, we need to map Altair's _encoding channels_ (_channels_ for short), to the columns in the dataset. The `encode()` method builds a **key-value mapping** between Altair's encoding channels (such as `x`, `y`, `color`, `shape`, `size`, etc.) to columns in the dataset, accessed by the **column name**.

In our example, we could **encode** the `i` values of the `sample_data` with the `x` channel of the `Chart`, which represents the x-axis position of the points.  Similarly, if we create a line chart, we can map the `f(i)` values to the `y` channel of the `Chart`.

In [None]:
alt.Chart(sample_data).mark_line().encode(
    x = 'i',
    y = 'f(i)'
)

Notice that the axes automatically inherited the dataframe's column names.

## Type of encodings

Note that grid lines and appropriate axis titles are automatically added to the resulting chart above. 

As described in the [Altair documentation](https://altair-viz.github.io/getting_started/starting.html#encodings-and-marks), when using "pandas dataframes, Altair automatically determines the appropriate data type for the mapped column". In our case, it inferred that `x` and `y` channels are both _quantitative_ type (i.e. real-valued).

Sometimes it is necessary to manually specify the encoding type. **"If types are not specified for data input as a DataFrame, Altair defaults to `quantitative` for any numeric data, `temporal` for date/time data, and `nominal` for string data, but be aware that these defaults are by no means always the correct choice!"** (Source: [Encoding Data Types](https://altair-viz.github.io/user_guide/encoding.html#encoding-data-types).)


Specifying the correct type for your data is crucial because it will affect how Altair represents your encoding in the resulting plot. For example, it would [affect the resulting color scales](https://altair-viz.github.io/user_guide/encoding.html#effect-of-data-type-on-color-scales) or [axis scales](https://altair-viz.github.io/user_guide/encoding.html#effect-of-data-type-on-axis-scales).

Altair allows you to control varios [Encoding Channel Options](https://altair-viz.github.io/user_guide/encoding.html#encoding-channel-options).

The details of any mapping depend on the <em>type</em> of the data. Altair recognizes
[the following data types](https://altair-viz.github.io/user_guide/encoding.html#encoding-data-types):

<table border="1" class="docutils">
<colgroup>
<col width="16%" />
<col width="19%" />
<col width="65%" />
</colgroup>
<thead valign="bottom">
<tr class="row-odd"><th class="head">Data Type</th>
<th class="head">Shorthand Code</th>
<th class="head">Description</th>
</tr>
</thead>
<tbody valign="top">
<tr class="row-even"><td>quantitative</td>
<td><code class="docutils literal"><span class="pre">Q</span></code></td>
<td>a continuous real-valued quantity</td>
</tr>
<tr class="row-odd"><td>ordinal</td>
<td><code class="docutils literal"><span class="pre">O</span></code></td>
<td>a discrete ordered quantity</td>
</tr>
<tr class="row-even"><td>nominal</td>
<td><code class="docutils literal"><span class="pre">N</span></code></td>
<td>a discrete unordered category</td>
</tr>
<tr class="row-odd"><td>temporal</td>
<td><code class="docutils literal"><span class="pre">T</span></code></td>
<td>a time or date value</td>
</tr>
<tr class="row-even"><td>geojson</td>
<td><code class="docutils literal"><span class="pre">G</span></code></td>
<td>a geographic shape</td>
</tr>
</tbody>
</table>



The types can either be expressed in a **long-form** using the _channel encoding classes_ such as `X` and `Y`, or in **short-form** using the [Shorthand Syntax](https://altair-viz.github.io/user_guide/encoding.html#shorthand-description) discussed below.

To use Altair's shorthand syntax, we need to use the **Shorthand Code** shown above and include it after the colon `:` in the dataFrame column specification during the encoding (as shown below).

In [None]:
alt.Chart(sample_data).mark_line().encode(
    x = 'i:Q',
    y = 'f(i):Q'
)

The shorthand is equivalent to spelling-out the attributes by name in the **long-form**. The code below produces the same results but **note the differences in syntax**, e.g., capitalized `X` and `Y` with the addition of `alt.` before them and parentheses instead of the assignment statement:

In [None]:
alt.Chart(sample_data).mark_line().encode(
    alt.X('i', type='quantitative'),
    alt.Y('f(i)', type='quantitative')
)

As mentioned in the [documentation](https://altair-viz.github.io/user_guide/encoding.html#encoding-data-types), the shorthand form "is useful for its lack of boilerplate when doing quick data explorations. The long-form, `alt.X('name', type='quantitative')`, is useful when doing more fine-tuned adjustments to the encoding, such as binning, axis and scale properties, or more."


## Customizing a chart

For some reason, in most tools, the default text in visualizations is small and hard to read. 

Whenever you pick up a new visualization tool, we recommend to immediately learn how to do the following minimum steps:
1. how to create/change axis labels
1. how to change (increase) the font size
1. how to include the title of the visualization
1. how to specify colors / color schemes in the chart

Let's learn how to do each of these in Altair.


### Rename axis label

Let's first rename the axis labels, since `i` is extra hard to see. We can manipulate each axis individually by using `alt.X` and `alt.Y`. Note that we need to provide the _column name_ from the dataframe and then the corresponding `title` that needs to be displayed instead.

In [None]:
chart1 = alt.Chart(sample_data).mark_line().encode(
    alt.X('i', title='x'),
    alt.Y('f(i)', title='sin(x/10)')
)
chart1 # now we can refer to this chart and change its properties later

### Change the font size

Notice that in the above visualization the title of each axis and the axis labels (the numbers for the grid) are quite small. In Altair, we can separately adjust the size of each font.

In [None]:
# by reassigning chart1, the configuration for this chart is saved
# nothing will be displayed after the assignment statement
chart1 = chart1.configure_axis(
    labelFontSize=14, # change axes label font size
    titleFontSize=16  # change axes title font size
)

chart1 # to actually show the resulting visualization

**Aside**: We also could have generated the above data using the Altair's [Sequence Generator](https://altair-viz.github.io/user_guide/data.html#generated-data).

## Detour: plotting more than one function to compare scales

In [None]:
num = np.arange(1, 1000) # range of x values

sample_data = pd.DataFrame({
    'x': num,
    'lin': num,
    'log(x)': np.log(num),
    'exp': np.array(2**(num/100))
})

sample_data

In [None]:
alt.Chart(sample_data).mark_line().encode(
    x = 'x',
    y = 'exp'
)

In [None]:
# https://altair-viz.github.io/user_guide/data.html#long-form-vs-wide-form-data
reshaped_data = sample_data.melt('x')
reshaped_data

Notice that `melt` automatically renames the resulting columns into `'variable'` and `'value'`.

In [None]:
alt.Chart(reshaped_data).mark_line().encode(
    x = 'x',
    y = 'value',
    color = 'variable'
)

Let's make our chart square.

In [None]:
alt.Chart(reshaped_data).mark_line().encode(
    x = 'x',
    y = 'value',
    color = 'variable'
).properties(
    width=400,
    height=400
)


A couple of helpful references:
[How to Read a Logarithmic Scale](https://www.wikihow.com/Read-a-Logarithmic-Scale)
and
[When Should I Use Logarithmic Scales in My Charts and Graphs?](https://www.forbes.com/sites/naomirobbins/2012/01/19/when-should-i-use-logarithmic-scales-in-my-charts-and-graphs/#a315cbc5e67b).

In [None]:
alt.Chart(reshaped_data).mark_line().encode(
    alt.X('x', scale=alt.Scale(type='log')),
    alt.Y('value', title='f(x)'),
    color = 'variable'
)

Let's make a [log-log plot](https://en.wikipedia.org/wiki/Log%E2%80%93log_plot)  that uses logarithmic scales on both the horizontal and vertical axes..

In [None]:
alt.Chart(reshaped_data).mark_line().encode(
    alt.X('x', scale=alt.Scale(type='log')),
    alt.Y('value', title='f(x)', scale=alt.Scale(type='log')),
    color = 'variable'
)

In [None]:
reshaped_data[reshaped_data['value'] <=0]

In [None]:
alt.Chart(reshaped_data).transform_filter(
    alt.datum.value > 0  
).mark_line().encode(
    alt.X('x', scale=alt.Scale(type='log')),
    alt.Y('value', title='f(x)', scale=alt.Scale(type='log')),
    color = 'variable'
)

Notice that the legend label automatically inherited the column name `'variable'`. Let's change it to explain that the colors in the legend refer to the functions. To do so, we will need to change `color` and use `alt.Color` and its `legend` property.


In [None]:
alt.Chart(reshaped_data).transform_filter(
    alt.datum.value > 0  
).mark_line().encode(
    alt.X('x', scale=alt.Scale(type='log')),
    alt.Y('value', title='f(x)', scale=alt.Scale(type='log')),
    color = alt.Color('variable', legend=alt.Legend(title="Functions"))
)

### Add the chart title

Let's finalize our visualization by adding the title and updating the fonts.

In [None]:
alt.Chart(reshaped_data).transform_filter(
    alt.datum.value > 0  
).mark_line().encode(
    alt.X('x', scale=alt.Scale(type='log')),
    alt.Y('value', title='f(x)', scale=alt.Scale(type='log')),
    color = alt.Color('variable', legend=alt.Legend(title="Functions"))
).properties(
    title="Comparison of exponential, linear, and logarithmic functions."
).configure_title(fontSize=18).configure_axis(
    labelFontSize=14, # change axes label font size
    titleFontSize=16  # change axes title font size
)

### Customize the legend

Now we see that the legend also has the default (small) text size. Let's fix it and make it readable like the rest of our text.

We'll need to change `configure_`...

In [None]:
chart = alt.Chart(reshaped_data).transform_filter(
    alt.datum.value > 0  
).mark_line().encode(
    alt.X('x', scale=alt.Scale(type='log')),
    alt.Y('value', title='f(x)', scale=alt.Scale(type='log')),
    color = alt.Color('variable', legend=alt.Legend(title="Functions"))
).properties(
    title="A log–log plot of exponential, linear, and logarithmic functions."
).configure_title(fontSize=18).configure_axis(
    labelFontSize=14, # change axes label font size
    titleFontSize=16  # change axes title font size
).configure_legend(
    labelFontSize=13,
    titleFontSize=14
)
chart

Here's additional documentation for customizing the legend.

https://altair-viz.github.io/user_guide/configuration.html#legend-configuration

https://altair-viz.github.io/user_guide/customization.html#adjusting-the-legend

https://altair-viz.github.io/user_guide/encoding.html#sorting-legends



## Select a new dataset

In order to learn how to add a chart title and change the default colors of a visualization, let's switch to a more interesting dataset. 

Let's use a package `vega_datasets` to select a real-world dataset to visualize.

In [None]:
from vega_datasets import data
# https://github.com/altair-viz/vega_datasets

In [None]:
# We can get helpful information by taking a look at the
# help documentation
# help(data)

In [None]:
# Looks like local_data has a method defined called list_datasets()

datasets = data.list_datasets() # what datasets are avaialable?
#datasets

In [None]:
# Unfortunately, most of the datasets don't have the description property
data.stocks.description

In [None]:
df = data.stocks()
#df = data('stocks')
df.head(125)

In [None]:
df.tail()

Looks like the data are arranged in descending order by the `symbol` and the ascending order by the `date`.

We can view the range of stock prices using a scatterplot.

In [None]:
stocks = alt.Chart(df).mark_point().encode(
    y = "symbol",
    x = "price"
).configure_axis(
    labelFontSize=14, # change axes label font size
    titleFontSize=16  # change axes title font size
)

stocks

In [None]:
stocks_line = alt.Chart(df).mark_line().encode(
    x = "date",
    y = "price",
    color = "symbol"
).configure_axis(
    labelFontSize=14, # change axes label font size
    titleFontSize=16  # change axes title font size
)

stocks_line

We can further differentiate the lines by adding `strokeDash='symbol'` as shown in this [Multi Series Line Chart](https://altair-viz.github.io/gallery/multi_series_line.html) example.

In [None]:
### Add the chart title

stocks_line = stocks_line.properties(
    title="Daily closing stock prices between 2000 and 2010."
).configure_title(fontSize=18)

stocks_line

### Adjust colors

https://altair-viz.github.io/user_guide/customization.html#customizing-colors

Vega-lite specification of color schemes: https://vega.github.io/vega/docs/schemes/

# Group Activity

You'll be placed into breakout groups: remember the number for your group -- that will be the number you need to change in this spreadsheet:
https://docs.google.com/spreadsheets/d/1QNxYLtzBpZ5AaVAcC3rPzvpaoyxXB7Zwz7z97eY2MOA/edit?usp=sharing