# Advanced Data Patterns with Altair

## Objectives

- Explore advanced data visualization techniques using Altair.
- Demonstrate the creation of boxplots to analyze the distribution and identify outliers in car prices by body style.
- Utilize violin plots to visualize the distribution density of car prices by fuel type.
- Illustrate the creation of heatmaps to examine the frequency of car makes and body styles.

## Background

This notebook leverages Altair to create advanced visual representations of data from the Automobile Dataset. The visualizations focus on statistical techniques such as boxplots, violin plots, and heatmaps to explore car attributes.

## Datasets Used

**Automobile Dataset from UCI**: This dataset contains detailed attributes of automobiles, including their prices, body styles, and fuel types, among other categorical and continuous variables. It is utilized here to demonstrate various statistical visualization techniques in Altair.

## Automobile Dataset

In [1]:
import pandas as pd
pd.set_option('display.max_columns', 8)
import altair as alt

We will use the Automobile Data Set [https://archive.ics.uci.edu/ml/datasets/automobile] from the UCI Machine Learning Repository [https://archive-beta.ics.uci.edu/]. It includes categorical and continuous variables. 

Defining the headers

In [2]:
# Defining the headers
headers = ["symboling", "normalized_losses", "make", "fuel_type", "aspiration", "num_doors", "body_style", 
        "drive_wheels", "engine_location", "wheel_base", "length", "width", "height", "curb_weight", 
        "engine_type", "num_cylinders", "engine_size", "fuel_system", "bore", "stroke", "compression_ratio", 
        "horsepower", "peak_rpm", "city_mpg", "highway_mpg", "price"]

In [3]:
df = pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/autos/imports-85.data",
                  header=None, names=headers, na_values="?" )
df.head()

Unnamed: 0,symboling,normalized_losses,make,fuel_type,...,peak_rpm,city_mpg,highway_mpg,price
0,3,,alfa-romero,gas,...,5000.0,21,27,13495.0
1,3,,alfa-romero,gas,...,5000.0,21,27,16500.0
2,1,,alfa-romero,gas,...,5000.0,19,26,16500.0
3,2,164.0,audi,gas,...,5500.0,24,30,13950.0
4,2,164.0,audi,gas,...,5500.0,18,22,17450.0


## Boxplots

In [4]:
alt.Chart(df).mark_boxplot().encode(
    x='body_style',
    y='price'
)

We get a tiny plot with boxplot filled in blue.

We can change the size of the Altair plot using properties() function.

In [5]:
alt.Chart(df).mark_boxplot().encode(
    x='body_style',
    y='price'
).properties(width=500)

Updating the box sizes

In [6]:
# Updating the box sizes
alt.Chart(df).mark_boxplot(size=60).encode(
    x='body_style',
    y='price'
).properties(width=500)

Adding a title

In [7]:
# Adding a title
alt.Chart(df).mark_boxplot(size=60).encode(
    x='body_style',
    y='price'
).properties(
    width=500,
    title='Box Plot: Car Price by Body style'
)

Format all titles

In [8]:
# Format all titles
alt.Chart(df).mark_boxplot(size=60).encode(
    alt.X('body_style', title='Body style'),
    alt.Y('price', title='Car Price'),    
).properties(
    width=500,
    title={"text":"Box Plot: Car Price by Body style", 
            "fontSize":20,
            "color": "steelblue"}
)

Rotating the axis labels

In [9]:
# Format all titles
alt.Chart(df).mark_boxplot(size=60).encode(
    alt.X('body_style', title='Body style', axis=alt.Axis(labelAngle=0)),
    alt.Y('price', title='Car Price'),    
).properties(
    width=500,
    title={"text":"Box Plot: Car Price by Body style", 
            "fontSize":20,
            "color": "steelblue"}
)

Increasing the size of axis titles

In [10]:
# Increasing the size of axis titles
alt.Chart(df).mark_boxplot(size=60).encode(
    alt.X('body_style', title='Body style', axis=alt.Axis(labelAngle=0)),
    alt.Y('price', title='Car Price'),    
).properties(
    width=500,
    title={"text":"Box Plot: Car Price by Body style", 
            "fontSize":20,
            "color": "steelblue"}
).configure_axis(
    titleFontSize=16
)

Increasing the axis labels

In [11]:
# Increasing the axis labels
alt.Chart(df).mark_boxplot(size=60).encode(
    alt.X('body_style', title='Body style', axis=alt.Axis(labelAngle=0)),
    alt.Y('price', title='Car Price'),    
).properties(
    width=500,
    title={"text":"Box Plot: Car Price by Body style", 
            "fontSize":20,
            "color": "steelblue"}
).configure_axis(
    titleFontSize=16,
    labelFontSize=13
)

Removing the grid lines

In [12]:
alt.Chart(df).mark_boxplot(size=60).encode(
    alt.X('body_style', title='Body style', axis=alt.Axis(labelAngle=0)),
    alt.Y('price', title='Car Price'),    
).properties(
    width=500,
    title={"text":"Box Plot: Car Price by Body style", 
            "fontSize":20,
            "color": "steelblue"}
).configure_axis(
    titleFontSize=16,
    labelFontSize=13,
    grid=False
)

Coloring the boxes

In [13]:
# Coloring the boxes
alt.Chart(df).mark_boxplot(size=60).encode(
    alt.X('body_style', title='Body style', axis=alt.Axis(labelAngle=0)),
    alt.Y('price', title='Car Price'),
    color=alt.Color('body_style', title='Body style')
).properties(
    width=500,
    title={"text":"Box Plot: Car Price by Body style", 
            "fontSize":20,
            "color": "steelblue"}
).configure_axis(
    titleFontSize=16,
    labelFontSize=13,
    grid=False
)

The boxplots can show outliers. By default, outliers are either below `(Q1 - 1.5 * IQR)` or above `(Q3 + 1.5 * IQR)`.

Showing outliers below Q1-1.5*IQR or above Q3+1.5*IQR (`extend=1.5`)

In [14]:
# Showing outliers below Q1-1.5*IQR or above Q3+1.5*IQR
alt.Chart(df).mark_boxplot(size=60, extent=1.5).encode(
    alt.X('body_style', title='Body style', axis=alt.Axis(labelAngle=0)),
    alt.Y('price', title='Car Price'),
    color=alt.Color('body_style', title='Body style')
).properties(
    width=500,
    title={"text":["Box Plot: Car Price by Body style","(circles represent outliers)"], 
            "fontSize":20,
            "color": "steelblue"}
).configure_axis(
    titleFontSize=16,
    labelFontSize=13,
    grid=False
)

Showing outliers below Q1-IQR or above Q3+IQR (`extend=1`)

In [15]:
# Showing outliers below Q1-IQR or above Q3+IQR
alt.Chart(df).mark_boxplot(size=60, extent=1).encode(
    alt.X('body_style', title='Body style', axis=alt.Axis(labelAngle=0)),
    alt.Y('price', title='Price'),
    color=alt.Color('body_style', title='Body style')
).properties(
    width=500,
    title={"text":["Box Plot: Price by Body style","(circles represent outliers)"], 
            "fontSize":20,
            "color": "steelblue"}
).configure_axis(
    titleFontSize=16,
    labelFontSize=13,
    grid=False
)

Showing outliers below Q1-0.5*IQR or above Q3+0.5*IQR (`extend=0.5`)

In [16]:
# Showing outliers below Q1-0.5*IQR or above Q3+0.5*IQR
alt.Chart(df).mark_boxplot(size=60, extent=0.5).encode(
    alt.X('body_style', title='Body style', axis=alt.Axis(labelAngle=0)),
    alt.Y('price', title='Price'),
    color=alt.Color('body_style', title='Body style')
).properties(
    width=500,
    title={"text":["Box Plot: Price by Body style","(circles represent outliers)"], 
            "fontSize":20,
            "color": "steelblue"}
).configure_axis(
    titleFontSize=16,
    labelFontSize=13,
    grid=False
)

## Violin Plots

Creating a simple violin plot of `price`

In [17]:
alt.Chart(df).transform_density(
    'price',
    as_=['price', 'density'],    
).mark_area().encode(
    alt.X('price:Q'),    
    alt.Y('density:Q', stack='center')    
)    
    

Swapping roles `x` and `y`, and adding `orient='horizontal'`

In [18]:
alt.Chart(df).transform_density(
    'price',
    as_=['price', 'density'],    
).mark_area(orient='horizontal').encode(
    alt.Y('price:Q'),    
    alt.X('density:Q', stack='center')   
)  

Styling axis titles

In [19]:
alt.Chart(df).transform_density(
    'price',
    as_=['price', 'density'],    
).mark_area(orient='horizontal').encode(
    alt.Y('price:Q', title='Car Price'),    
    alt.X('density:Q', stack='center')   
).configure_axis(
    titleFontSize=16,
    labelFontSize=13,
)  

Grouping by `fuel_type`

In [20]:
alt.Chart(df).transform_density(
    'price',
    as_=['price', 'density'],  
    groupby=['fuel_type']   
).mark_area(orient='horizontal').encode(
    alt.Y('price:Q', title='Car Price'),    
    alt.X('density:Q', stack='center'),   
    alt.Color('fuel_type:N', title='Fuel type')
).configure_axis(
    titleFontSize=16,
    labelFontSize=13,
)  

Let's separate the violins into two columns.

In [21]:
alt.Chart(df).transform_density(
    'price',
    as_=['price', 'density'],  
    groupby=['fuel_type']   
).mark_area(orient='horizontal').encode(
    alt.Y('price:Q', title='Car Price'),    
    alt.X('density:Q', stack='center'),   
    alt.Color('fuel_type:N', title='Fuel type'),
    alt.Column('fuel_type')
).configure_axis(
    titleFontSize=16,
    labelFontSize=13,
)  

We can also separate them into two rows.

In [22]:
alt.Chart(df).transform_density(
    'price',
    as_=['price', 'density'],  
    groupby=['fuel_type']   
).mark_area(orient='horizontal').encode(
    alt.Y('price:Q', title='Car Price'),    
    alt.X('density:Q', stack='center'),   
    alt.Color('fuel_type:N', title='Fuel type'),
    alt.Row('fuel_type')
).configure_axis(
    titleFontSize=16,
    labelFontSize=13,
)

Adding `impute=None`

In [23]:
alt.Chart(df).transform_density(
    'price',
    as_=['price', 'density'],  
    groupby=['fuel_type']   
).mark_area(orient='horizontal').encode(
    alt.Y('price:Q', title='Car Price'),    
    alt.X('density:Q', stack='center', impute=None),   
    alt.Color('fuel_type:N', title='Fuel type'),
    alt.Row('fuel_type')
).configure_axis(
    titleFontSize=16,
    labelFontSize=13,
)

## Strip Plot

This example shows the relationship between the Price and fuel type using tick marks.

In [24]:
alt.Chart(df).mark_tick().encode(
    x='price:Q',
    y='fuel_type:N'
).properties(
    title='Strip Plot of Price by Fuel type'
)

In [25]:
alt.Chart(df).mark_tick().encode(
    x='price:Q',
    y='body_style:N'
).properties(
    width = 600,
    height = 200,
    title='Strip Plot of Price by Body style'
)

## Heatmaps

A heatmap is a data visualization method showing a phenomenon's magnitude as color in two dimensions. The color variation gives visual cues about how the phenomenon varies.

In [26]:
alt.Chart(df).mark_rect().encode(
    alt.X('make:N', title="Make"),
    alt.Y('body_style:N', title="Body Style"),
    color='count()',
).properties(
    width=600,
    height=150,
    title= {'text':'Make vs. Body Style',
            'fontSize':20}
)

Configuring common options

In [27]:
# Configuring common options
base = alt.Chart(df).transform_aggregate(
    num_records='count()',
    groupby=['make', 'body_style']
).encode(
    alt.X('make:N', title='Make'),
    alt.Y('body_style:N', title='Body style'),
).properties(
    width=600,
    height=150,
    title= {'text':'Make vs. Body Style',
            'fontSize':20}
)

Configuring the heatmap

In [28]:
# Configuring the heatmap
map = base.mark_rect().encode(
    color=alt.Color('num_records:Q')
)
map

Configuring text

In [29]:
# Configuring text
text = base.mark_text(baseline='middle').encode(
    text='num_records:Q',
    color=alt.condition(
        alt.datum.num_records < 6,
        alt.value('black'),
        alt.value('white')
    )
)
text

In [30]:
map + text

Changing the color scheme

In [31]:
# Changing the color scheme
map = base.mark_rect().encode(
    color = alt.Color('num_records:Q', 
        scale = alt.Scale(scheme='reds'))
)

In [32]:
map + text

## Conclusions

- Boxplots effectively reveal car prices' central tendencies and dispersion across different body styles, highlighting outliers and quartile distributions.
- Violin plots offer a deeper understanding of price distributions by fuel type, showing the density and spread of data in a more detailed manner than traditional boxplots.
- Heatmaps are helpful for visualizing the concentration of categorical interactions, such as the frequency of car makes per body style. They provide a color-coded intensity map that is easy to interpret.
- Altair's flexible API directly supports complex visual transformations, facilitating sophisticated statistical analyses and visualizations within Python's ecosystem.

## References

- https://altair-viz.github.io/