# 00: Intro to Visualization in Python

## Resources

Python has several data visualization libraries, including:

- [Matplotlib](https://matplotlib.org)
- [Seaborn](https://seaborn.pydata.org)
- [Bokeh](https://docs.bokeh.org/en/latest/)
- [Altair](https://github.com/altair-viz/altair)
- [Plotly](https://plotly.com/python/)
- [Plotnine](https://plotnine.readthedocs.io/en/stable/)

If you're interested in building an interactive application or dashboard in Python for your project, then you may want to check out:

- [streamlit](https://streamlit.io)
- [Plotly Dash](https://dash.plotly.com/introduction)
- [Panel](https://panel.holoviz.org)
- [H20 Wave](https://wave.h2o.ai)

## Altair

In case you are not familiar with any Python visualization libraries, then this notebook will provide an overview of Altair. This content is based on the [Altair documentation](https://altair-viz.github.io).

Altair is essentially a Python API for [Vega-Lite](https://vega.github.io/vega-lite/), which is a grammar for specifying interactive graphics in JSON. Altair has a straightforward API that focuses on specifying marks and channel encodings, which is a nice match to how we conceptually describe visualizations. Learning Altair will also make for an easy transition to using Vega-Lite elsewhere, such as in a web app. One downside of Altair is that it is cumbersome to create static visualizations outside of the browser. If your goal is to write a Python script that saves charts to an image file on your computer without using the browser, then Altair is not the best choice. In some cases, the Altair documentation is not very thorough and you may be better off referencing the Vega-Lite documentation instead.

### Data

You can pass your dataset to Altair as a pandas dataframe or as a path to a JSON or CSV file. Altair works best with data that is in [tidy](https://vita.had.co.nz/papers/tidy-data.pdf) (or [long](https://altair-viz.github.io/user_guide/data.html#long-form-vs-wide-form-data)) format. If you pass a pandas dataframe, then Altair will automatically infer data types, but it will include the entire dataset in the chart's specification. With large datasets, this can lead to large chart and notebook sizes.

In [None]:
import altair as alt
import pandas as pd

In [None]:
# dataset is from https://observablehq.com/@d3/bar-chart
letter_frequencies = pd.DataFrame({
    'letter': ["A","B","C","D","E","F","G","H","I","J","K","L","M","N","O","P","Q","R","S","T","U","V","W","X","Y","Z"],
    'frequency': [0.08167,0.01492,0.02782,0.04253,0.12702,0.02288,0.02015,0.06094,0.06966,0.00153,0.00772,0.04025,0.02406,0.06749,0.07507,0.01929,0.00095,0.05987,0.06327,0.09056,0.02758,0.00978,0.0236,0.0015,0.01974,0.00074]
})

letter_frequencies

Here, we are passing the dataset as a pandas dataframe.

In [None]:
letter_freq_bar_chart = alt.Chart(letter_frequencies).mark_bar().encode(
    x='letter',
    y='frequency'
)

letter_freq_bar_chart

When we look at the Vega-Lite specification that Altair creates for this chart, we can see that the entire dataset is included. This makes the chart portable, but increases its size.

In [None]:
print(letter_freq_bar_chart.to_json())

We can avoid having the dataset included in the chart specification by saving the dataset to a JSON file and then passing the path to that file to Altair. **Note that this works locally, but not on Colab**.

However, when we do this, we have to specify the types of our data.

|Data Type    | Shorthand|
|-------------|----------|
|quantitative | Q        |
|ordinal      | O        |
|nominal      | N        |
|temporal     | T        |
|geojson      | G        |

In [None]:
!mkdir data

letter_frequencies.to_json('data/letter_frequencies.json', orient='records')

letter_freq_bar_chart_2 = alt.Chart('data/letter_frequencies.json').mark_bar().encode(
    x='letter:N',
    y='frequency:Q'
)

letter_freq_bar_chart_2

Now we can see that the entire dataset is not included in the Vega-Lite specification.

In [None]:
print(letter_freq_bar_chart_2.to_json())

We can also have Altair save our dataframe to a file automatically. With this enabled, Altair will still infer data types and the chart specification won't include the dataset. Again, **note that this does not work on Colab**.

In [None]:
alt.data_transformers.enable('json', prefix='data/altair-data')

letter_freq_bar_chart_3 = alt.Chart(letter_frequencies).mark_bar().encode(
    x='letter',
    y='frequency'
)

letter_freq_bar_chart_3

In [None]:
print(letter_freq_bar_chart_3.to_json())

### Fundamentals

To create a chart in Altair, you specify the type of mark that you want to use and what you want the channels of the marks to encode. 

#### Marks

In the next few charts, we use the same encodings with different marks.

In [None]:
alt.Chart(letter_frequencies).mark_bar().encode(
    x='letter',
    y='frequency'
)

In [None]:
alt.Chart(letter_frequencies).mark_point().encode(
    x='letter',
    y='frequency'
)

In [None]:
alt.Chart(letter_frequencies).mark_square().encode(
    x='letter',
    y='frequency'
)

In [None]:
alt.Chart(letter_frequencies).mark_rule().encode(
    x='letter',
    y='frequency'
)

In [None]:
alt.Chart(letter_frequencies).mark_tick().encode(
    x='letter',
    y='frequency'
)

Another way that we could have done this is by specifying a base chart that does not have an encoding.

In [None]:
base = alt.Chart(letter_frequencies).encode(
    x='letter',
    y='frequency'
)

And then call the mark that we want on the base chart.

In [None]:
base.mark_bar()

In [None]:
base.mark_point()

#### Encodings

If we want a horizontal bar chart instead of a vertical bar chart, then we can swap the x and y encodings.

In [None]:
alt.Chart(letter_frequencies).mark_bar().encode(
    y='letter',
    x='frequency'
)

For another example of using channels to encode data, we add a column to our dataframe to indicate whether or not the letter is a vowel and then color the bars accordingly.

In [None]:
letter_frequencies['is_vowel'] = letter_frequencies['letter'].isin(('A', 'E', 'I', 'O', 'U'))

In [None]:
alt.Chart(letter_frequencies).mark_bar().encode(
    x='letter',
    y='frequency',
    color='is_vowel'
)

#### Sorting

Up to this point, we've been using a shorthand way to specify the encodings. If you need greater control over the scale or axis, then you can use the long-form. For example, the long-form of `x='letter'` is `x=alt.X('letter')`. Through this long-form `alt.X()`, we can specify how to sort bars. Below, we sort the bars in alphabetical order, which is the order that the dataset was already in.

In [None]:
alt.Chart(letter_frequencies).mark_bar().encode(
    x=alt.X('letter').sort('ascending'),
    y='frequency',
)

Here we sort the in reverse alphabetical order.

In [None]:
alt.Chart(letter_frequencies).mark_bar().encode(
    x=alt.X('letter').sort('descending'),
    y='frequency',
)

We can also sort the bars according to another channel. For example, here we sort the bars in descending order by frequency.

In [None]:
alt.Chart(letter_frequencies).mark_bar().encode(
    x=alt.X('letter').sort('-y'),
    y='frequency',
)

#### Aggregation

Altair [supports](https://altair-viz.github.io/altair-tutorial/notebooks/03-Binning-and-aggregation.html) grouping, binning, and aggregating your data. For example, here we have a bar chart that shows the average frequency of vowels and consonants. The same approach applies for other aggregations, like min, max, median, q1 (first quartile), q3 (third quartile), count, stdev (standard deviation), etc.

In [None]:
alt.Chart(letter_frequencies).mark_bar().encode(
    x='average(frequency)',
    y='is_vowel'
)

In the long-form, the above chart would look like this.

In [None]:
alt.Chart(letter_frequencies).mark_bar().encode(
    x=alt.X('frequency').aggregate('average'),
    y='is_vowel'
)

count behaves differently from the other aggregations in that it does not need a column. For example, here we count the number of vowels and consonants.

In [None]:
alt.Chart(letter_frequencies).mark_bar().encode(
    x='count()',
    y='is_vowel'
)

For binning, we need to use the long-form. Here we have a histogram that bins the frequencies and shows the number of letters in each bin.

In [None]:
alt.Chart(letter_frequencies).mark_bar().encode(
    x=alt.X('frequency').bin(),
    y='count()'
)

We can use [.bin()](https://altair-viz.github.io/user_guide/generated/core/altair.BinParams.html) to get more control over the bins.

In [None]:
alt.Chart(letter_frequencies).mark_bar().encode(
    x=alt.X('frequency').bin(step=0.05),
    y='count()'
)

*Practice*

Below we have a dataset on daily bike rentals.

In [None]:
bike = pd.read_csv('https://raw.githubusercontent.com/christophM/interpretable-ml-book/master/data/bike.csv')
bike.head()

Make a bar chart that shows the median number of bikes rented ("cnt") for each weather situation ("weathersit"). Sort the bars from lowest to highest count.

Make a histogram for the number of bikes rented ("cnt").

### Basic Plots

#### Scatter plot

In [None]:
alt.Chart(bike).mark_point().encode(
    x='temp',
    y='hum'
)

In [None]:
alt.Chart(bike).mark_point().encode(
    x='temp',
    y='hum',
    color='season'
)

In [None]:
alt.Chart(bike).mark_point().encode(
    x='temp',
    y='hum',
    shape='season'
)

In [None]:
alt.Chart(bike).mark_circle().encode(
    x='temp',
    y='hum',
    color='cnt'
)

In [None]:
alt.Chart(bike).mark_point().encode(
    x='temp',
    y='hum',
    size='cnt'
)

#### Strip plot

We can explicitly set the sorted order for the seasons.

In [None]:
alt.Chart(bike).mark_tick().encode(
    x='temp',
    y=alt.Y('season').sort(['WINTER', 'SPRING', 'SUMMER', 'FALL']),
)

#### 2D Histogram

In [None]:
alt.Chart(bike).mark_circle().encode(
    x=alt.X('temp').bin(),
    y=alt.Y('hum').bin(),
    size='average(cnt)',
)

#### Adjacency Matrix

We can use `.properties()` to set top-level properties, like the width and height of the chart.

In [None]:
alt.Chart(bike).mark_rect().encode(
    x='season',
    y='weathersit',
    color='average(cnt)',
).properties(
    width=200,
    height=200
)

In [None]:
alt.Chart(bike).mark_circle().encode(
    x='season',
    y='weathersit',
    size='average(cnt)',
)

Note that changing the plot dimensions does not change the size of the circles automatically.

In [None]:
alt.Chart(bike).mark_circle().encode(
    x='season',
    y='weathersit',
    size='average(cnt)',
).properties(
    width=200,
    height=150
)

To do that, we can set the width and height of the chart based on the step of the scales. Now the chart is 200 pixels wide (4 * 50, for the four seasons) and 150 pixels tall (3 * 50, for the three weather situations).

In [None]:
alt.Chart(bike).mark_circle().encode(
    x='season',
    y='weathersit',
    size='average(cnt)',
).properties(
    width=alt.Step(50),
    height=alt.Step(50)
)

#### Line and area charts

In [None]:
alt.Chart(bike).mark_line().encode(
    x='days_since_2011',
    y='cnt'
)

For more complex aggregations, like rolling windows, you [can](https://altair-viz.github.io/user_guide/transform/window.html#user-guide-window-transform) do them directly in Altair, but it's probably easier to just do it in pandas.

In [None]:
bike_rolling_avg = bike.rolling(on='days_since_2011', window=7)['cnt'].mean().reset_index()

In [None]:
alt.Chart(bike_rolling_avg).mark_line().encode(
    x='index',
    y='cnt'
)

We can replace `mark_line` with `mark_area` to get an area plot.

In [None]:
alt.Chart(bike_rolling_avg).mark_area().encode(
    x='index',
    y='cnt'
)

In [None]:
months = bike['mnth'].unique()
months

For `mark_line`, adding a color encoding will create multiple lines. Note that when setting the encoding for color, we have to correct the type for year.

In [None]:
alt.Chart(bike).mark_line().encode(
    x=alt.X('mnth').sort(months),
    y='median(cnt)',
    color='yr:N'
)

For `mark_area`, adding a color encoding will create stacked areas.

In [None]:
alt.Chart(bike).mark_area().encode(
    x=alt.X('mnth').sort(months),
    y='median(cnt)',
    color='yr:N'
)

We can use the order channel to specify how the layers are ordered.

In [None]:
alt.Chart(bike).mark_area().encode(
    x=alt.X('mnth').sort(months),
    y='median(cnt)',
    color='yr:N',
    order='yr'
)

We can use the stack property to make a normalized stacked area chart.

In [None]:
alt.Chart(bike).mark_area().encode(
    x=alt.X('mnth').sort(months),
    y=alt.Y('median(cnt)').stack('normalize'),
    color='yr:N'
)

#### Stacked Bar Chart

In [None]:
alt.Chart(bike).mark_bar().encode(
    y='season',
    x='count()',
    color='weathersit'
)

In [None]:
alt.Chart(bike).mark_bar().encode(
    y='season',
    x=alt.X('count()').stack('normalize'),
    color='weathersit'
)

### Facets

To create faceted charts or small multiples, we can use the `row` and `column` channels.

In [None]:
alt.Chart(bike).mark_bar().encode(
    y='weathersit',
    x='count()',
    row='season'
)

In [None]:
alt.Chart(bike).mark_bar().encode(
    x='weathersit',
    y='count()',
    column='season'
)

Note that for faceted charts, the width and height properties set the size of one of the charts, not all of them together.

In [None]:
alt.Chart(bike).mark_point().encode(
    x='temp',
    y='hum',
    column='season',
    row='weathersit'
).properties(
    width=150,
    height=150
)

In [None]:
alt.Chart(bike, width=150, height=150).mark_point().encode(
    x='temp',
    y='hum',
    column='season',
)

If we want to wrap a facet across multiple rows, then we can use the facet channel.

In [None]:
alt.Chart(bike).mark_point().encode(
    x='temp',
    y='hum',
    facet=alt.Facet('season').columns(2)
).properties(
    width=150,
    height=150,
)

### Concatenation and Layers

We can use the plus sign (+) to layer one chart over another.

In [None]:
base = alt.Chart(bike).encode(
    x=alt.X('mnth').sort(months),
    y='median(cnt)',
    color='yr:N'
)

In [None]:
base.mark_line()

In [None]:
base.mark_circle()

In [None]:
base.mark_line() + base.mark_circle()

*Practice*

Make a [lollipop chart](https://datavizproject.com/data-type/lollipop-chart/) that shows the median count for each month.

We can use the pipe (|) to horizontally concatenate charts.

In [None]:
base = alt.Chart(bike)

scatter = base.mark_point().encode(
    x='temp',
    y='hum'
)

bar = base.mark_bar().encode(
    y='count()',
    x='weathersit'
)

scatter | bar

We can use an ampersand (&) to vertically concatenate charts.

In [None]:
scatter & bar

We can use both together, such as to show marginal distributions.

In [None]:
base = alt.Chart(bike)

scatter = base.mark_point().encode(
    x='temp',
    y='hum'
)

right_ticks = base.mark_tick().encode(
    y=alt.Y('hum').axis(None),
    opacity=alt.value(0.2)
)

top_ticks = base.mark_tick(opacity=0.2).encode(
    x=alt.X('temp').axis(None)
)

top_ticks & (scatter | right_ticks)

### Customization

#### Dimensions

In [None]:
alt.Chart(bike).mark_point().encode(
    x='temp',
    y='hum'
).properties(
    width=500,
    height=500
)

#### Mark properties

In [None]:
alt.Chart(bike).mark_point(opacity=0.25).encode(
    x='temp',
    y='hum',
    color=alt.value('red')
)

#### Labels

In [None]:
alt.Chart(bike).mark_point().encode(
    x=alt.X('temp').title('Temperature'),
    y=alt.Y('hum').title('Humidity')
).properties(
    title='Temperature vs. Humidity'
)

#### Axes

In [None]:
alt.Chart(bike).mark_point().encode(
    x=alt.X('temp').axis(grid=False),
    y=alt.Y('hum').axis(format='.2f')
)

#### Scales

In [None]:
alt.Chart(bike).mark_bar().encode(
    x=alt.X('mnth').sort(months).scale(padding=0.5),
    y=alt.Y('average(cnt)').scale(type='log')
)

In [None]:
alt.Chart(bike).mark_point().encode(
    x=alt.X('temp').scale(nice=False),
    y=alt.Y('hum').scale(nice=False)
)

In [None]:
alt.Chart(bike).mark_line().encode(
    x=alt.X('mnth').sort(months),
    y=alt.Y('median(cnt)').scale(zero=False),
    color='yr:N'
)

Here are the included [color schemes](https://vega.github.io/vega/docs/schemes/#reference).

In [None]:
alt.Chart(bike).mark_circle().encode(
    x='temp',
    y='hum',
    color=alt.Color('cnt').scale(scheme='viridis')
)

In [None]:
alt.Chart(bike).mark_circle().encode(
    x='temp',
    y='hum',
    color=alt.Color('cnt').scale(scheme='brownbluegreen', domainMid=bike['cnt'].median(), reverse=True)
)

We can also specify our own custom color scheme.

In [None]:
alt.Chart(bike).mark_point().encode(
    x='temp',
    y='hum',
    color=alt.Color('season').scale(
        domain=['WINTER', 'SPRING', 'SUMMER', 'FALL'],
        range=['#264653', '#2A9D8F', '#E9C46A', '#F4A261']
    )
)

### Interaction

Basic interaction can be achieved with `.interaction()` and by setting a tooltip.

In [None]:
alt.Chart(bike).mark_point().encode(
    x='temp',
    y='hum',
    color='season',
    tooltip=['temp', 'hum', 'season']
).interactive()

We can also add brushing.

In [None]:
brush = alt.selection_interval()

alt.Chart(bike).mark_point().encode(
    x='temp',
    y='hum',
    color=alt.condition(brush, 'season', alt.value('#dddddd')),
).add_params(
    brush
)

In [None]:
brush = alt.selection_interval()

base = alt.Chart(bike).mark_point().encode(
    color=alt.condition(brush, 'season', alt.value('#dddddd')),
).add_params(
    brush
).properties(
    width=350,
    height=350
)

base.encode(x='temp', y='hum') | base.encode(x='windspeed', y='cnt')

Previously we had this static chart.

In [None]:
base = alt.Chart(bike)

scatter = base.mark_point().encode(
    x='temp',
    y='hum'
)

bar = base.mark_bar().encode(
    y='count()',
    x='weathersit'
)

scatter | bar

We can use brushing to make it interactive.

In [None]:
brush = alt.selection_interval()

scatter = alt.Chart(bike).mark_point().encode(
    x='temp',
    y='hum',
    color=alt.condition(brush, alt.value('steelblue'), alt.value('#dddddd'))
).add_params(brush)

bar = alt.Chart(bike).mark_bar().encode(
    y='count()',
    x='weathersit'
).transform_filter(
    brush
)

scatter | bar

### Practice

In [None]:
from vega_datasets import data
cars = data.cars()
cars.head()

What's the relationship between Miles_per_Gallon, Horsepower, and Cylinders?

How many cars are there from each origin?

How many cars are there from each origin for each number of cylinders?

...