# Tutorial 1: Introduccion a Altair
**INF3842: Visualizacion de Informacion y Analitica Visual**

En esta guia introduciremos la libreria de visualizaciones [Altair](https://altair-viz.github.io/)

*Altair is a declarative statistical visualization library for Python, based on Vega and Vega-Lite, and the source is available on GitHub.*

Nota: a diferencia de otras librerias como Seaborn, Altair **NO** esta construido sobre matplotlib

In [1]:
import altair as alt
import pandas as pd
import numpy as np

Primero hagamos un grafico basico con altair

In [2]:
from vega_datasets import data
cars = data.cars()

Cargamos un dataset de Vega, en este caso un dataset de autos

In [3]:
cars

Unnamed: 0,Name,Miles_per_Gallon,Cylinders,Displacement,Horsepower,Weight_in_lbs,Acceleration,Year,Origin
0,chevrolet chevelle malibu,18.0,8,307.0,130.0,3504,12.0,1970-01-01,USA
1,buick skylark 320,15.0,8,350.0,165.0,3693,11.5,1970-01-01,USA
2,plymouth satellite,18.0,8,318.0,150.0,3436,11.0,1970-01-01,USA
3,amc rebel sst,16.0,8,304.0,150.0,3433,12.0,1970-01-01,USA
4,ford torino,17.0,8,302.0,140.0,3449,10.5,1970-01-01,USA
...,...,...,...,...,...,...,...,...,...
401,ford mustang gl,27.0,4,140.0,86.0,2790,15.6,1982-01-01,USA
402,vw pickup,44.0,4,97.0,52.0,2130,24.6,1982-01-01,Europe
403,dodge rampage,32.0,4,135.0,84.0,2295,11.6,1982-01-01,USA
404,ford ranger,28.0,4,120.0,79.0,2625,18.6,1982-01-01,USA


El formato es un dataframe

In [4]:
type(cars)

pandas.core.frame.DataFrame

Hagamos un scatterplot que enfrente los caballos de fuerza de un auto versus su rendimiento en millas por galon

In [5]:
alt.Chart(cars).mark_point().encode(
    x='Horsepower',
    y='Miles_per_Gallon',
    color='Origin',
)

Reconocemos dos conceptos que hemos visto en el curso: marcas y canales. 

* La marca utilizada es el punto (`mark_point()`)
* Tenemos canales espaciales, el eje x y el eje y donde estamos ubicando los caballos de fuerza y el rendimiento, respectivamente
* Con el color estamos codificando el pais de origen de los automoviles

# Marcas

Tenemos distintas marcas disponibles en Altair. Partamos por las marcas esenciales

**Arcos**

In [6]:
source = pd.DataFrame({"category": [1, 2, 3, 4, 5, 6], "value": [4, 6, 10, 3, 7, 8]})

alt.Chart(source).mark_arc().encode(
    theta=alt.Theta(field="value", type="quantitative"),
    color=alt.Color(field="category", type="nominal"),
)

**Areas**

In [7]:
source = data.iowa_electricity()

alt.Chart(source).mark_area().encode(
    x="year:T",
    y="net_generation:Q",
    color="source:N"
)

In [8]:
source

Unnamed: 0,year,source,net_generation
0,2001-01-01,Fossil Fuels,35361
1,2002-01-01,Fossil Fuels,35991
2,2003-01-01,Fossil Fuels,36234
3,2004-01-01,Fossil Fuels,36205
4,2005-01-01,Fossil Fuels,36883
5,2006-01-01,Fossil Fuels,37014
6,2007-01-01,Fossil Fuels,41389
7,2008-01-01,Fossil Fuels,42734
8,2009-01-01,Fossil Fuels,38620
9,2010-01-01,Fossil Fuels,42750


**Barras**

In [None]:
source = pd.DataFrame({
    'a': ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I'],
    'b': [28, 55, 43, 91, 81, 53, 19, 87, 52]
})

alt.Chart(source).mark_bar().encode(
    x='a',
    y='b'
)

**Lineas**

In [9]:
x = np.arange(100)

source = pd.DataFrame({
  'x': x,
  'f(x)': np.sin(x / 5)
})

alt.Chart(source).mark_line().encode(
    x='x',
    y='f(x)'
)

**Areas rectangulares**

In [10]:
# Areas rectangulares

x, y = np.meshgrid(range(-5, 5), range(-5, 5))
z = x ** 2 + y ** 2

# Convert this grid to columnar data expected by Altair
source = pd.DataFrame({'x': x.ravel(),
                     'y': y.ravel(),
                     'z': z.ravel()})

alt.Chart(source).mark_rect().encode(
    x='x:O',
    y='y:O',
    color='z:Q'
)

In [15]:
source

Unnamed: 0,x,y,z
0,-5,-5,50
1,-4,-5,41
2,-3,-5,34
3,-2,-5,29
4,-1,-5,26
...,...,...,...
95,0,4,16
96,1,4,17
97,2,4,20
98,3,4,25


In [12]:
y

array([[-5, -5, -5, -5, -5, -5, -5, -5, -5, -5],
       [-4, -4, -4, -4, -4, -4, -4, -4, -4, -4],
       [-3, -3, -3, -3, -3, -3, -3, -3, -3, -3],
       [-2, -2, -2, -2, -2, -2, -2, -2, -2, -2],
       [-1, -1, -1, -1, -1, -1, -1, -1, -1, -1],
       [ 0,  0,  0,  0,  0,  0,  0,  0,  0,  0],
       [ 1,  1,  1,  1,  1,  1,  1,  1,  1,  1],
       [ 2,  2,  2,  2,  2,  2,  2,  2,  2,  2],
       [ 3,  3,  3,  3,  3,  3,  3,  3,  3,  3],
       [ 4,  4,  4,  4,  4,  4,  4,  4,  4,  4]])

## Marcas compuestas

**Boxplot**

In [19]:
import altair as alt
from vega_datasets import data

source = data.population()

alt.Chart(source).mark_boxplot(extent='min-max').encode(
    x='age:O',
    y='people:Q'
)

In [20]:
source

Unnamed: 0,year,age,sex,people
0,1850,0,1,1483789
1,1850,0,2,1450376
2,1850,5,1,1411067
3,1850,5,2,1359668
4,1850,10,1,1260099
...,...,...,...,...
565,2000,80,2,3221898
566,2000,85,1,970357
567,2000,85,2,1981156
568,2000,90,1,336303


**Banda de error**

In [21]:
source = data.cars()

line = alt.Chart(source).mark_line().encode(
    x='Year',
    y='mean(Miles_per_Gallon)'
)

band = alt.Chart(source).mark_errorband(extent='ci').encode(
    x='Year',
    y=alt.Y('Miles_per_Gallon', title='Miles/Gallon'),
)

band + line

**Barra de error**

In [22]:
source = data.barley()

error_bars = alt.Chart(source).mark_errorbar(extent='ci').encode(
  x=alt.X('yield:Q', scale=alt.Scale(zero=False)),
  y=alt.Y('variety:N')
)

points = alt.Chart(source).mark_point(filled=True, color='black').encode(
  x=alt.X('yield:Q', aggregate='mean'),
  y=alt.Y('variety:N'),
)

error_bars + points

In [23]:
source

Unnamed: 0,yield,variety,year,site
0,27.00000,Manchuria,1931,University Farm
1,48.86667,Manchuria,1931,Waseca
2,27.43334,Manchuria,1931,Morris
3,39.93333,Manchuria,1931,Crookston
4,32.96667,Manchuria,1931,Grand Rapids
...,...,...,...,...
115,58.16667,Wisconsin No. 38,1932,Waseca
116,47.16667,Wisconsin No. 38,1932,Morris
117,35.90000,Wisconsin No. 38,1932,Crookston
118,20.66667,Wisconsin No. 38,1932,Grand Rapids


Las marcas tienen propiedades que se especifican en los argumentos de `mark_*()`. Mas informacion [aqui](https://altair-viz.github.io/user_guide/marks.html#mark-properties)

# Canales

Altair posee varios canales de codificacion (tanto de magnitud como de identidad). 

## Canales de posicion

In [24]:
# Ejes x e y
source = data.cars()

alt.Chart(source).mark_circle(size=60).encode(
    x='Horsepower',
    y='Miles_per_Gallon',
    color='Origin'
)

In [26]:
# Rangos (en este caso en el eje X)
source = pd.DataFrame([
    {"task": "A", "start": 1, "end": 3},
    {"task": "B", "start": 3, "end": 8},
    {"task": "C", "start": 8, "end": 10}
])

alt.Chart(source).mark_bar().encode(
    x='start',
    x2='end',
    y='task'
)


In [27]:
# Angulos y Radio (Theta)

source = pd.DataFrame({"values": [12, 23, 47, 6, 52, 19]})

base = alt.Chart(source).encode(
    theta=alt.Theta("values:Q", stack=True),
    radius=alt.Radius("values", scale=alt.Scale(type="sqrt", zero=True, rangeMin=20)),
    color="values:N",
)

c1 = base.mark_arc(innerRadius=20, stroke="#fff")

c2 = base.mark_text(radiusOffset=10).encode(text="values:Q")

c1 + c2

## Canales que aplican a la marca

In [30]:
# Color y forma

alt.Chart(cars).mark_point().encode(
    x='Horsepower',
    y='Miles_per_Gallon',
    color='Origin',
    shape='Origin'
)


In [33]:
# Tamaño

source = data.github()

alt.Chart(source).mark_circle().encode(
    x='hours(time):O',
    y='day(time):O',
    size='sum(count):Q',
    color='sum(count):Q'
)

In [32]:
source

Unnamed: 0,time,count
0,2015-01-01 01:00:00,2
1,2015-01-01 04:00:00,3
2,2015-01-01 05:00:00,1
3,2015-01-01 08:00:00,1
4,2015-01-01 09:00:00,3
...,...,...
950,2015-05-29 17:00:00,1
951,2015-05-29 19:00:00,1
952,2015-05-30 00:00:00,10
953,2015-05-30 09:00:00,1


In [35]:
# Colormaps
# https://vega.github.io/vega/docs/schemes/

iris = data.iris()

alt.Chart(iris).mark_point().encode(
    x='petalWidth',
    y='petalLength',
    color=alt.Color('species', scale=alt.Scale(scheme='dark2'))
)

In [37]:
# Custom Colormaps

iris = data.iris()
domain = ['setosa', 'versicolor', 'virginica']
range_ = ['red', '#5fad82', '#edb64e']

alt.Chart(iris).mark_point().encode(
    x='petalWidth',
    y='petalLength',
    color=alt.Color('species', scale=alt.Scale(domain=domain, range=range_))
)

In [40]:
# Quizas el colormap viene en el dataset

df = pd.DataFrame({
    'x': range(6),
    'color': ['red', 'steelblue', 'chartreuse', '#F4D03F', '#D35400', '#7D3C98']
})

alt.Chart(df).mark_point(
    filled=True,
    size=100
).encode(
    x='x',
    color=alt.Color('color', scale=None)
)

In [41]:
df

Unnamed: 0,x,color
0,0,red
1,1,steelblue
2,2,chartreuse
3,3,#F4D03F
4,4,#D35400
5,5,#7D3C98


## Otros canales

In [42]:
# texto
source = pd.DataFrame({
    'x': [1, 3, 5, 7, 9],
    'y': [1, 3, 5, 7, 9],
    'label': ['A', 'B', 'C', 'D', 'E']
})

points = alt.Chart(source).mark_point().encode(
    x='x:Q',
    y='y:Q'
)

text = points.mark_text(
    align='left',
    baseline='middle',
    dx=7
).encode(
    text='label'
)

points + text


In [44]:
points = alt.Chart(source).mark_point().encode(
    x='x:Q',
    y='y:Q', 
    tooltip='label'
)

points

In [45]:
# Interactividad (Clase 7)

alt.Chart(cars).mark_point().encode(
    x='Horsepower',
    y='Miles_per_Gallon',
    color='Origin',
    tooltip='Name'
).interactive()

# Otros

In [46]:
# Orden

barley = data.barley()

alt.Chart(barley).mark_bar().encode(
    x='variety:N',
    y='sum(yield):Q',
    color='site:N',
    order=alt.Order("site", sort="ascending")
)

In [48]:
# Ajustar ejes

alt.Chart(cars).mark_point().encode(
    alt.X('Acceleration:Q',
        scale=alt.Scale(domain=(5, 30))
    ),
    y='Horsepower:Q'
)

In [49]:
# Labels

df = pd.DataFrame({'x': [0.03, 0.04, 0.05, 0.12, 0.07, 0.15],
                   'y': [10, 35, 39, 50, 24, 35]})

alt.Chart(df).mark_circle().encode(
    x=alt.X('x', axis=alt.Axis(format='%', title='percentage')),
    y=alt.Y('y', axis=alt.Axis(format='$', title='dollar amount'))
)

In [50]:
# leyenda

iris = data.iris()

alt.Chart(iris).mark_point().encode(
    x='petalWidth',
    y='petalLength',
    color=alt.Color('species', legend=alt.Legend(title="Species by color"))
)

In [52]:
# Grilla y  bordes

iris = data.iris()

alt.Chart(iris).mark_point().encode(
    x='petalWidth',
    y='petalLength',
    color='species'
).configure_axis(
    grid=False
).configure_view(
    strokeWidth=0
)

In [53]:
# Tamaño

cars = data.cars()

alt.Chart(cars).mark_bar().encode(
    x='Origin',
    y='count()'
).properties(
    width=200,
    height=150
)

In [None]:
# Multiples visualizaciones

alt.Chart(cars).mark_bar().encode(
    x='Origin',
    y='count()',
    column='Cylinders:Q'
)