# Introduction to Altair (and working with Jupyter Notebooks)

This is an interactive tutorial to get you used to working with notebooks and to teach you a bit about visualizing data with Altair. 

Initially, you will be provided with code, and you just need to run the cell (Shift-Enter or see the Run menu for more options). Later, you will need to provide some code of your own.

---
We will start with some standard declarations to import the libraries we need. In addition to the `altair` library, we will be using `numpy` (numerical python) and `pandas` (python data analysis library) to help manage the data before we hand it off to Altair. Nothing will happen when you run the cell, but make sure you run it anyway (otherwise nothing else in the notebook will run).

*Incidentally, I am using the `import XYZ as X` abbreviations because they are fairly standard, even if I do think they are the product of lazy typists...*

In [None]:
import altair as alt
import numpy as np
import pandas as pd

Next, we'll create some random data using numpy.

In [None]:
x = np.random.randint(0,25,10)
y = np.random.randint(0,25,10)

print('x:',x)
print('y:',y)

Altair is designed to work best with Pandas DataFrames in "tidy" format, which is the long form that we talked about (every variable is a column, every observation is a row). [Pandas](http://pandas.pydata.org/pandas-docs/stable/) provides a lot of tools for manipulating data. You will pick some of it up as we go along, but I encourage you to consult the documentation when you find yourself needing to make changes to your data.   

*Note that putting the variable on its own means that it gets returned like it would be in the normal Python shell.*

In [None]:
df = pd.DataFrame({'x':x, 'y':y})
df

In this instance, we could print the whole DataFrame out, but usually you will want to just use df.head() so you don't get the entire data set. Change the number of random items from 10 to 100, and then use `df.head()` to see just the first five rows. *You can always make changes to cells and re-run them. Just be careful about the downstream cells. The state of the variables is based on execution order, not document order. So if, for example, you had reassigned x below, when you re-ran this cell, it would have the new value.*

One thing to note is that the DataFrame has added an extra column, called the `index`, which is basically just the row number in this case. While not imemdiately important, it can be good to know it is there. 

## Making a visualization
---

Okay, time to make a visualization. The process is very much like the one we described previously (though backwards): create a chart, set its mark type, and then configure the encodings. We will start by creating our chart and setting the mark to point.

In [None]:
alt.Chart(df).mark_point()

This is not very interesting because we didn't set any encodings. 

In [None]:
alt.Chart(df).mark_point().encode(x='x')

This said to assign our variable x to the x position. Now, we'll add y.

In [None]:
alt.Chart(df).mark_point().encode(x='x', y='y')

And we have a scatterplot. There are other encodings such as `size`, `color`, and `shape`, as well as other marks like `bar`, `tick`, and `line`.

In the cell below, I've added a third variable to our data. Create a new chart that maps `z` to the `size` encoding to create a bubble plot.

In [None]:
z = np.random.randint(0,25,10)

df = pd.DataFrame({'x':x, 'y':y, 'z':z})

# replace with your chart

Now try it again, mapping `z` to `color`.

In [None]:
# replace with your chart

## Nominal data and aggregation
---

Now, we will make another dataset with some nominal data in it. Let's say happened to keep running into creatures from Doctor Who, and every time we counted how many of them we encountered.

In [None]:
types = ['Dalek', 'Cyberman', 'Ice Warrior']

observed_types = [types[i] for i in np.random.randint(0,len(types), 10)]
counts = np.random.randint(1,10,10)

df = pd.DataFrame({'type':observed_types, 'number':counts})
df

Let's make a bar chart to look at the total number of each type of creature we encountered. 

In [None]:
alt.Chart(df).mark_bar().encode(x='type:N', y='sum(number)')

Notice that I added some information to the encodings. First, I added `:N` to `type`. Altair typically can guess what kind of information you have, but we can be explicit and tell Altair that we have nominal (N), quantitative (Q), ordinal (O), or temporal (T) data. 

The other thing I added in was an **aggregation operator**: `sum()`. This aggregated the y values, grouped by the x values (in this case, it added up the counts for each type of creature).

The vertical bar chart doesn't look great. Flip the encodings to make this a horizontal bar chart. Also, let's see the average (use `average()`) of the counts instead of the sums. 

In [None]:
# replace with your chart

## Customizing
---

We can customize some of the visual attributes of our marks that we are not tying to an encoding.

*Altair uses [web colors](https://en.wikipedia.org/wiki/Web_colors)*

In [None]:
alt.Chart(df).mark_bar(color="darkslateblue").encode(x='type:N', y='sum(number)')

We have also been using the short form of the encodings, where we just pass a string. There is a long form, where we create an encoding object, which allows us to be more explicit, and to exert more control.

In [None]:
alt.Chart(df).mark_bar(color="darkslateblue").encode(
    x=alt.X('type', type="nominal"), 
    y=alt.Y('number', type="quantitative", aggregate="sum")
)

We can then specify attributes of the axis as well.

In [None]:
alt.Chart(df).mark_bar(color="darkslateblue").encode(
    x=alt.X('type', type="nominal", axis=alt.Axis(title="Creature Type")), 
    y=alt.Y('number', type="quantitative", aggregate="sum", axis=alt.Axis(title="Number of Creatures"))
)

And we can also set some basic properties on the chart itself.

In [None]:
alt.Chart(df).mark_bar(color="cadetblue").encode(
    x=alt.X('type', type="nominal", axis=alt.Axis(title="Creature Type")), 
    y=alt.Y('number', type="quantitative", aggregate="sum", axis=alt.Axis(title="Number of Creatures"))
).properties(
    width=350,
    height=150,
    title="Creature Encounters"
)

## Importing data
---

We can use pandas to load in data in other formats. We will use it to pull in the Doctor Who episode data that we have been using as an example in class.

The `read_csv` command can also work locally. If this file was in the same directory as the notebook, we could just use `pd.read_csv('dr_who.csv')`. *How cool is it that we can just suck in data from the web?*

In [None]:
df = pd.read_csv("http://www.cs.middlebury.edu/~candrews/classes/cs465-f18/data/dr_who.csv")
df

We will learn some more features of Altair shortly, but you are strongly encouraged to read through the [Altair Documentation](https://altair-viz.github.io/index.html) later, and look through the [Example Gallery](https://altair-viz.github.io/gallery/index.html) (particularly the Simple Charts, Bar Charts, Line Charts, and Scatter Plots).


## Your turn
---

The `df` DataFrame is all loaded up with Doctor Who data. You are going to create a couple of charts for me. Don't just give me the simplest defaults. Customize them a bit with good axis labels and titles, and think about orientation, sizing, and color as well.

Create a horizontal bar chart showing the total duration of each doctor (use the actor's name rather than the doctor number). *Challenge: order the bars by duration*

In [None]:
# your chart here

I said above Altair couldn't always guess what type your data was. The doctor variable is an example of this. Create a bar chart of doctor to episode, forcing `doctor` to be **nominal**.

In [None]:
# your chart here

Repeat your chart, but force it to be **quantitative**.

In [None]:
# your chart here

Which one should it be and why?

*double click and replace with your answer*

Graph the number of companions by start year as a line chart.

In [None]:
# your chart here