[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/nekrut/bda/blob/colab/lectures/lecture8.ipynb)

# Lecture 8: Introduction to Altair

This notebook introduces **Altair**, a Python library for creating statistical visualizations. We start with the basics and progressively build toward analyzing real-world genomic metadata.

By the end, you will be able to:
- Create basic charts (scatter, bar, line)
- Encode data fields to visual properties
- Aggregate and transform data within charts
- Customize colors, scales, and labels
- Layer multiple chart elements
- Build publication-quality heatmaps

## Part 1: What is Declarative Visualization?

Think of ordering food at a restaurant. You don't walk into the kitchen and say "heat the pan to 375¬∞F, dice the onions, saut√© for 3 minutes..." ‚Äî you just say "I'd like the pasta." That's the difference between **imperative** and **declarative**.

**Imperative** (matplotlib) ‚Äî you specify *how* to draw, step by step. You manage coordinates, colors, labels, legends, and layout yourself:

```python
import matplotlib.pyplot as plt

fig, ax = plt.subplots(figsize=(6, 3))
cities = ['Seattle', 'New York', 'Chicago']
temps = [53.7, 52.7, 48.7]
colors = ['#4c78a8', '#f58518', '#e45756']
bars = ax.barh(cities, temps, color=colors)
ax.set_xlabel('Average Temperature (¬∞F)')
ax.set_title('Average Temperature by City')
ax.bar_label(bars, fmt='%.1f')
ax.set_xlim(0, 65)
plt.tight_layout()
plt.show()
```

**Declarative** (Altair) ‚Äî you describe *what* you want to see. You state the relationships between your data and visual properties. Altair handles scales, axes, labels, and layout automatically:

```python
alt.Chart(weather).mark_bar().encode(
    x='average(temp):Q',
    y='city:N',
    color='city:N'
)
```

The key difference: with matplotlib you compute the averages yourself, position each bar, pick colors, format labels, and manage layout. With Altair you declare "show average temperature by city, color by city" and the library does the rest ‚Äî including aggregation, axis scaling, and a legend.

## Part 2: Setup

In [1]:
import pandas as pd
import altair as alt

Let's create a simple dataset to work with‚Äîmonthly precipitation for three cities:

In [2]:
weather = pd.DataFrame({
    'city': ['Seattle', 'Seattle', 'Seattle', 'Seattle', 'Seattle', 'Seattle',
             'New York', 'New York', 'New York', 'New York', 'New York', 'New York',
             'Chicago', 'Chicago', 'Chicago', 'Chicago', 'Chicago', 'Chicago'],
    'month': ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun',
              'Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun',
              'Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun'],
    'precip': [5.2, 3.9, 4.1, 2.8, 2.1, 1.6,
               3.6, 3.1, 4.2, 4.0, 4.5, 4.2,
               2.0, 1.9, 2.6, 3.7, 4.1, 4.0],
    'temp': [42, 45, 50, 55, 62, 68,
             35, 38, 48, 58, 68, 77,
             28, 32, 42, 52, 64, 74]
})

weather

Unnamed: 0,city,month,precip,temp
0,Seattle,Jan,5.2,42
1,Seattle,Feb,3.9,45
2,Seattle,Mar,4.1,50
3,Seattle,Apr,2.8,55
4,Seattle,May,2.1,62
5,Seattle,Jun,1.6,68
6,New York,Jan,3.6,35
7,New York,Feb,3.1,38
8,New York,Mar,4.2,48
9,New York,Apr,4.0,58


This is **tidy data**: each row is one observation, each column is one variable. Altair works best with tidy data.

## Part 3: Your First Chart

### The Three Building Blocks

Every Altair chart has three components:

1. **Data** ‚Äî a pandas DataFrame
2. **Mark** ‚Äî the visual shape (point, bar, line, etc.)
3. **Encoding** ‚Äî which data fields map to which visual properties

### Creating a Chart Object

Start by wrapping your DataFrame in `alt.Chart()`:

In [3]:
# This creates a chart object - it stores data but can't render without a mark
chart = alt.Chart(weather)
print(type(chart))  # It's an Altair Chart object

<class 'altair.vegalite.v5.api.Chart'>


The chart object exists but can't display‚ÄîAltair requires a **mark** to render. Let's add one.

### Adding a Mark

In [4]:
# mark_point() draws circles
alt.Chart(weather).mark_point()

We see points, but they're all stacked on top of each other. We need **encodings** to spread them out.

### Adding Encodings

Encodings map data columns to visual channels like position (x, y), color, size, etc.

In [5]:
alt.Chart(weather).mark_point().encode(
    x='precip',
    y='city'
)

Now each point is positioned:
- Horizontally by precipitation value
- Vertically by city name

Notice how Altair automatically:
- Created axis labels from column names
- Scaled the x-axis to fit the data
- Separated cities on the y-axis

### Different [Mark Types](https://altair-viz.github.io/user_guide/marks/index.html)

Altair provides many mark types. Here are the most common:

In [6]:
# Bar chart
alt.Chart(weather).mark_bar().encode(
    x='precip',
    y='city'
)

In [13]:
alt.Chart(weather).mark_tick().encode(
    x='precip',
    y='city'
)

In [7]:
# Line chart
alt.Chart(weather).mark_line().encode(
    x='month',
    y='precip'
)

In [17]:
alt.Chart(weather).mark_point().encode(
    x='temp',
    y='precip'
)

The line chart connects all points. We'll learn how to separate by city later using color encoding.

> **üìù Exercise 1:** Create a scatter plot with `temp` on x-axis and `precip` on y-axis.

## Part 4: Data Types

Altair needs to know the **type** of each data field to choose appropriate scales and displays:

| Type | Code | Description | Example |
|------|------|-------------|--------|
| Quantitative | `:Q` | Numerical values | Temperature, price |
| Nominal | `:N` | Categories (no order) | City names, colors |
| Ordinal | `:O` | Ordered categories | Small/Medium/Large |
| Temporal | `:T` | Date/time | 2024-01-15 |

You specify types by adding them after the field name with a colon:

In [8]:
# Explicit type annotations
alt.Chart(weather).mark_bar().encode(
    x='precip:Q',  # Quantitative
    y='city:N'     # Nominal
)

Altair usually guesses correctly, but explicit types prevent surprises.

### Controlling Sort Order

By default, Altair sorts axis values alphabetically. To get chronological order, you must explicitly specify the sort order:

In [24]:
# Default: sorted alphabetically (Apr, Feb, Jan, Jun, Mar, May)
alt.Chart(weather).mark_bar().encode(
    x=alt.X('month:O', sort=['jan', 'feb', 'mar', 'apr', 'may', 'jun']),
    y='average(precip):Q'
)

In [15]:
# Explicit sort: chronological order
alt.Chart(weather).mark_bar().encode(
    x=alt.X('month:O', sort=['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun']),
    y='average(precip):Q'
)

> **üìù Exercise 2:** Create a bar chart showing average temperature per month with proper chronological order.

In Altair, you can aggregate directly in the encoding string:

In [25]:
alt.Chart(weather).mark_bar().encode(
    x='average(precip):Q',  # Average of precip column
    y='city:N'
)

Altair automatically grouped by city and calculated the average for each.

### Available Aggregation Functions

- `count()` ‚Äî number of rows
- `sum(field)` ‚Äî total
- `average(field)` or `mean(field)` ‚Äî average
- `median(field)` ‚Äî median
- `min(field)` / `max(field)` ‚Äî extremes
- `stdev(field)` ‚Äî standard deviation

In [26]:
# Count observations per city
alt.Chart(weather).mark_bar().encode(
    x='count():Q',
    y='city:N'
)

In [29]:
# Max temperature per city
alt.Chart(weather).mark_circle().encode(
    x='max(temp):Q',
    y='city:N'
)

> **üìù Exercise 3:** Create a bar chart showing **total** precipitation per month (across all cities).

## Part 6: Color Encoding

In [30]:
alt.Chart(weather).mark_line().encode(
    x='month:O',
    y='precip:Q',
    color='city:N'  # Different color for each city
)

Each city now has its own line with a distinct color. Altair added a legend automatically.

### Color for Quantitative Data

You can also map numeric values to color intensity. See [Vega Color Schemes](https://vega.github.io/vega/docs/schemes/) for all available palettes.

In [31]:
alt.Chart(weather).mark_circle(size=100).encode(
    x='month:O',
    y='city:N',
    color='precip:Q'  # Color intensity shows precipitation
)

In [37]:
# Try different color schemes from Vega
alt.Chart(weather).mark_circle(size=100).encode(
    x='month:O',
    y='city:N',
    color=alt.Color('precip:Q', scale=alt.Scale(scheme='redyellowblue'))  # Try: 'plasma', 'inferno', 'magma', 'turbo', 'blues', 'greens', 'oranges', 'reds', 'purples', 'goldred', 'redyellowblue'
)

Darker colors indicate higher precipitation. This is the foundation of a heatmap!

> **üìù Exercise 4:** Create a scatter plot of `temp` vs `precip` with color encoding for `city`.

In [40]:
alt.Chart(weather).mark_point().encode(
    x='temp',
    y='precip',
    color='city:N'
)

In [43]:
alt.Chart(weather).mark_point(
    color='firebrick',  # Fixed color for all points
    size=50,           # Fixed size
    opacity=0.7         # Transparency
).encode(
    x='temp:Q',
    y='precip:Q'
)

In [46]:
alt.Chart(weather).mark_bar(color='lightblue').encode(
    x='average(precip):Q',
    y='city:N'
).properties(
    width=400,
    height=150,
    title='Average Precipitation by City'
)

### Axis and Scale Customization

For more control, use `alt.X()` and `alt.Y()` objects instead of strings:

In [48]:
alt.Chart(weather).mark_bar(color='lightblue').encode(
    x=alt.X(
        'average(precip):Q',
        title='Average Precipitation (inches)',  # Custom axis title
        scale=alt.Scale(domain=[0, 5])           # Fixed axis range
    ),
    y=alt.Y(
        'city:N',
        title='City',
        axis=alt.Axis(labelFontSize=12)          # Larger labels
    )
).properties(
    width=400,
    height=150
)

### Color Schemes

Altair includes many built-in color schemes:

In [49]:
alt.Chart(weather).mark_circle(size=200).encode(
    x='month:O',
    y='city:N',
    color=alt.Color(
        'precip:Q',
        scale=alt.Scale(scheme='blues')  # Blue color gradient
    )
).properties(width=300, height=150)

Popular schemes: `'blues'`, `'greens'`, `'oranges'`, `'viridis'`, `'goldred'`, `'redyellowblue'`

> **üìù Exercise 5:** Create a bar chart of average temperature per city with orange bars and a title.

In [56]:
alt.Chart(weather).mark_bar().encode(
    x=alt.X('average(temp):Q', title='Average Temperature (¬∞F)'),
    y='city:N',
    color=alt.value('orange') # Sets all bars to orange
).properties(
    width=300,
    height=150,
    title='Average Temperature by City'
)

In [52]:
# Heatmap base
heatmap = alt.Chart(weather).mark_rect().encode(
    x='month:O',
    y='city:N',
    color=alt.Color('precip:Q', scale=alt.Scale(scheme='goldred'))
)

# Text with conditional color
text = alt.Chart(weather).mark_text(
    fontSize=12,
    fontWeight='bold'
).encode(
    x='month:O',
    y='city:N',
    text=alt.Text('precip:Q', format='.1f'),
    color=alt.condition(
        alt.datum.precip > 3.5,    # If precip > 3.5
        alt.value('white'),         # Use white text
        alt.value('black')          # Otherwise black
    )
)

(heatmap + text).properties(width=300, height=150)

Now high values have white text (readable on dark red) and low values have black text.

In [57]:
# Data with huge range
wide_range = pd.DataFrame({
    'category': ['A', 'B', 'C', 'D'],
    'value': [10, 100, 1000, 50000]
})

# Linear scale - small values barely visible
alt.Chart(wide_range).mark_bar().encode(
    x='category:N',
    y='value:Q'
).properties(title='Linear Scale')

> **üìù Exercise 6:** Create a bar chart of average temperature per city with text labels on the bars.

> Add blockquote



In [59]:
bars = alt.Chart(weather).mark_bar(color='gray').encode(
    x=alt.X('average(temp):Q', title='Average Temperature (¬∞F)'),
    y='city:N'
)

text = alt.Chart(weather).mark_text(align='left', baseline='middle', dx=3, color='black').encode(
    x=alt.X('average(temp):Q', stack='zero'),
    y='city:N',
    text=alt.Text('average(temp):Q', format='.1f')
)

(bars + text).properties(
    width=400,
    height=150,
    title='Average Temperature by City with Labels'
)

In [66]:
# Log color scale - shows variation across all magnitudes
alt.Chart(wide_range).mark_rect().encode(
    x='category:N',
    y=alt.value(1),  # Single row
    color=alt.Color(
        'value:Q',
        scale=alt.Scale(scheme='goldred', type='log')
    )
).properties(width=300, height=50, title='Log Color Scale')

## Part 8: Real-World Example ‚Äî SRA Metadata

The [Sequence Read Archive](https://www.ncbi.nlm.nih.gov/sra) (SRA) is the largest public repository of sequencing data. Here we analyze SARS-CoV-2 metadata to understand how sequencing platforms and library protocols were used during the pandemic.

In [67]:
# Load SRA metadata snapshot from Zenodo (first 100k records for speed)
sra = pd.read_csv(
    "https://zenodo.org/records/10680776/files/ena.tsv.gz",
    compression='gzip',
    sep="\t",
    low_memory=False,
    nrows=100000
)

sra.sample(3)

Unnamed: 0,study_accession,base_count,accession,collection_date,country,culture_collection,description,sample_collection,sample_title,sequencing_method,...,library_name,library_construction_protocol,library_layout,instrument_model,instrument_platform,isolation_source,isolate,investigation_type,collection_date_submitted,center_name
87881,PRJNA884724,31166137.0,SAMN31168055,2022-09-19,USA: Pennsylvania,,Illumina MiSeq sequencing,,SARS-CoV-2 sequencing for surveillance in Phil...,,...,ARTIC Network Protocol V4,,PAIRED,Illumina MiSeq,ILLUMINA,nasal swab,SARS-CoV-2/Human/USA/PHL2-B-B11-20221004/2022,,2022-09-19,Philadelphia Public Health Laboratory
98264,PRJNA716984,16047589.0,SAMN29860536,2022-07-10,USA: Rhode Island,,Sequel II sequencing,,CDC Sars CoV2 Sequencing Baseline Constellation,,...,Unknown,Freed primers,PAIRED,Sequel II,PACBIO_SMRT,Nasal Swabs,SARS-CoV-2/Human/USA/RI-CDC-LC0770316/2022,,2022-07-10,
18530,PRJNA731148,284377497.0,SAMN23294533,2021-10-31,USA: Massachusetts,,Illumina NovaSeq 6000 sequencing,,CDC Sars CoV2 Sequencing Baseline Constellation,,...,TaqPath COVID-19 Combo Kit,Illumina COVIDSeq Test v03,PAIRED,Illumina NovaSeq 6000,ILLUMINA,Nasal - Anterior Nares,SARS-CoV-2/Human/USA/MA-CDC-ASC210449955/2021,,2021-10-31,


> ‚ö†Ô∏è **Data Quality:** The metadata is only as good as who entered it. Always validate date ranges!

### Aggregate for Visualization

In [68]:
# Group by platform and library strategy, count unique runs
heatmap_data = sra.groupby(
    ['instrument_platform', 'library_strategy']
).agg(
    {'run_accession': 'nunique'}
).reset_index()

heatmap_data

Unnamed: 0,instrument_platform,library_strategy,run_accession
0,BGISEQ,AMPLICON,1
1,BGISEQ,OTHER,13
2,BGISEQ,RNA-Seq,2
3,BGISEQ,Targeted-Capture,2
4,DNBSEQ,AMPLICON,3
5,ILLUMINA,AMPLICON,78448
6,ILLUMINA,OTHER,3
7,ILLUMINA,RNA-Seq,554
8,ILLUMINA,Targeted-Capture,273
9,ILLUMINA,WCS,2


### Create the Heatmap

In [69]:
# Basic heatmap
alt.Chart(heatmap_data).mark_rect().encode(
    x='instrument_platform:N',
    y='library_strategy:N',
    color='run_accession:Q'
)

### Final Polished Heatmap

In [70]:
# Background: colored rectangles
background = alt.Chart(heatmap_data).mark_rect(opacity=1).encode(
    x=alt.X(
        'instrument_platform:N',
        title='Sequencing Platform'
    ),
    y=alt.Y(
        'library_strategy:N',
        title='Library Strategy',
        axis=alt.Axis(orient='right')
    ),
    color=alt.Color(
        'run_accession:Q',
        title='# Samples',
        scale=alt.Scale(
            scheme='goldred',
            type='log'  # Log scale for color!
        )
    ),
    tooltip=[
        alt.Tooltip('instrument_platform:N', title='Platform'),
        alt.Tooltip('library_strategy:N', title='Strategy'),
        alt.Tooltip('run_accession:Q', title='Number of runs', format=',')
    ]
).properties(
    width=500,
    height=200,
    title={
        'text': 'SARS-CoV-2 Sequencing in ENA',
        'subtitle': 'By Platform and Library Strategy (100k sample)'
    }
)

background

### Add Text Labels

In [71]:
# Text layer with conditional coloring
text_labels = background.mark_text(
    align='center',
    baseline='middle',
    fontSize=11,
    fontWeight='bold'
).encode(
    text=alt.Text('run_accession:Q', format=','),  # Comma-formatted numbers
    color=alt.condition(
        alt.datum.run_accession > 200,  # If value > 200
        alt.value('white'),              # White text (on dark background)
        alt.value('black')               # Black text (on light background)
    )
)

# Combine layers
background + text_labels

This visualization reveals:
- **ILLUMINA + AMPLICON** dominates (78k+ samples) ‚Äî Illumina short-reads with PCR amplification
- **PACBIO_SMRT** also heavily uses AMPLICON protocol
- **RNA-Seq** is relatively rare compared to AMPLICON
- Some platform/strategy combinations have very few samples

> **üìù Exercise 7:** Create a bar chart showing the top 5 countries by number of SRA submissions.

In [75]:
# Consolidate all USA states into a single 'USA' category
sra['country_cleaned'] = sra['country'].apply(lambda x: 'USA' if isinstance(x, str) and x.startswith('USA:') else x)

# Calculate top 5 countries by SRA submissions using the cleaned country data
top_countries = sra['country_cleaned'].value_counts().head(5).reset_index()
top_countries.columns = ['country', 'submission_count']

# Create the bar chart
alt.Chart(top_countries).mark_bar().encode(
    x=alt.X('submission_count:Q', title='Number of SRA Submissions'),
    y=alt.Y('country:N', sort='-x', title='Country'),
    tooltip=['country', 'submission_count']
).properties(
    title='Top 5 Countries by SRA Submissions'
)

## Summary

| Concept | Syntax |
|---------|--------|
| Create chart | `alt.Chart(df)` |
| Add marks | `.mark_point()`, `.mark_bar()`, `.mark_line()`, `.mark_rect()` |
| Encode data | `.encode(x='col', y='col')` |
| Data types | `:Q` (quantitative), `:N` (nominal), `:O` (ordinal), `:T` (temporal) |
| Aggregation | `'average(col):Q'`, `'sum(col):Q'`, `'count():Q'` |
| Color encoding | `color='col:N'` or `color=alt.Color('col:Q', scale=alt.Scale(scheme='blues'))` |
| Customization | `alt.X('col', title='Label', scale=alt.Scale(...))` |
| Properties | `.properties(width=400, height=200, title='Title')` |
| Layer charts | `chart1 + chart2` |
| Conditional | `alt.condition(predicate, if_true, if_false)` |
| Log scale | `scale=alt.Scale(type='log')` |

## Further Resources

- [Altair Documentation](https://altair-viz.github.io/) ‚Äî Official docs with tutorials
- [Altair Example Gallery](https://altair-viz.github.io/gallery/index.html) ‚Äî Hundreds of examples to copy
- [Vega-Lite](https://vega.github.io/vega-lite/) ‚Äî The underlying grammar Altair uses
- [Vega Color Schemes](https://vega.github.io/vega/docs/schemes/) ‚Äî All available color palettes

## Take-home project 1

Now apply what you've learned: [Take-home project 1](../Projects/Project%201.ipynb)