We will visualize a variety of datasets from the [vega-datasets](https://github.com/vega/vega-datasets) collection:

- A dataset of `cars` from the 1970s and early 1980s,
- A dataset of `movies`, previously used in the [Data Transformation](https://github.com/uwdata/visualization-curriculum/blob/master/altair_data_transformation.ipynb) notebook,
- A dataset of technology company `stocks`, and

In [142]:
import altair as alt
import pandas as pd
cars = 'https://cdn.jsdelivr.net/npm/vega-datasets@1/data/cars.json'
movies = 'https://cdn.jsdelivr.net/npm/vega-datasets@1/data/movies.json'
stocks = 'https://cdn.jsdelivr.net/npm/vega-datasets@1/data/stocks.csv'
stateinfo = 'https://cdn.jsdelivr.net/npm/vega-datasets@v1.29.0/data/population_engineers_hurricanes.csv'
usshapes = 'https://cdn.jsdelivr.net/npm/vega-datasets@v1.29.0/data/us-10m.json'
states = alt.topo_feature(usshapes, 'states')

Some resources that might be of use for the worksheet today include:
- Altair/Vega [color scheme options](https://vega.github.io/vega/docs/schemes/)
- [CSS color names list](https://www.w3.org/wiki/CSS/Properties/color/keywords)
- [ColorBrewer](https://colorbrewer2.org/#type=sequential&scheme=BuGn&n=3) color palette explorer
- [Colorblindly](https://chrome.google.com/webstore/detail/colorblindly/floniaahmccleoclneebhhmnjgdfijgg) Google Chrome extension
- [Colorvision](https://addons.mozilla.org/en-US/firefox/addon/colorvision/) Firefox extension

## Instructions

There are 9 plots given below, each with some color, scale, axes, etc. misused in some fashion. For each plot,
1. Describe the flaw(s). Be thorough with your answers, as there are multiple issues in some plots. Be sure to justify your claim in terms of what we have learned about perception, expressiveness, effectiveness, etc. 

2. Make the changes and recreate the plot. Place your updated version, titled "Updated", to the right of the original plot so they can be viewed simultaneously. You may change the encodings as appropriate, but do not change the variables used-- the updated plot **must** maintain the original purpose. 

## Categorical Colors

### Plot 1

Flaws: The X-axis uses a logarithmic scale for IMDB_Votes, but the axis label does not mention that it’s a log scale which is misleading. Thousands of movie points overlap, especially in popular vote ranges, making it difficult to distinguish high-density regions. This violates effectiveness, as the audience cannot accurately perceive distribution or clustering. Using opacity we can view the overlapping points.

In [117]:
original = alt.Chart(movies).mark_circle().encode(
    alt.X('IMDB_Votes:Q').scale(type='log'),
    alt.Y('IMDB_Rating:Q'),
    alt.Color('Major_Genre:N').scale(scheme='tableau20')
).properties(title='Original')

updated = alt.Chart(movies).mark_circle(size=40, opacity=0.4).encode(
    alt.X('IMDB_Votes:Q',
          scale=alt.Scale(type='log'),
          axis=alt.Axis(tickCount=10),
          title='IMDB Votes (log10 scale)'),
    alt.Y('IMDB_Rating:Q',
          scale=alt.Scale(type='log'),
          axis=alt.Axis(tickCount=10),
          title='IMDB Rating'),
    alt.Color('Major_Genre:N',
          scale=alt.Scale(scheme='tableau20'))
).properties(title='updated',width=300, height=300)

original  | updated

### Plot 2

Flaws: We can see that No.of cylinders is not a continous variable. But in the original plot Cylinders is encoded as quantitative (:Q), but it’s actually a categorical variable (distinct engine types – 3, 4, 5, 6, 8). Using a continuous color gradient suggests smooth numeric progression. This violates the expressiveness because the visual encoding implies relationships that don’t exist. Moreover there are two line plots with same color which is hard to interpret.
Also the x-axis label Year(year) is not appropriate.

In [168]:
original = alt.Chart(cars).mark_line().encode(
    alt.X('year(Year):T'),
    alt.Y('mean(Acceleration):Q'),
    alt.Color('Cylinders:Q').scale(scheme='rainbow')
).properties(title='Original')
updated = alt.Chart(cars).mark_line().encode(
    alt.X('year(Year):T', title='Year'),
    alt.Y('mean(Acceleration):Q',axis=alt.Axis(tickCount=10), scale = alt.Scale(domain=[10, 21])),
    alt.Color('Cylinders:N').scale(scheme='rainbow')
).properties(title='Updated')
original | updated

In [146]:
c=pd.read_json(cars)
c['Cylinders'].unique()

array([8, 4, 6, 3, 5])

### Plot 3

Flaws: - The chosen color range (['#C5C5C5','#378FA3','#C60000','#FEC600','#007AC5']) mixes grays, reds, blues, and yellows in no perceptual order.
Since engineers is a quantitative variable, viewers expect higher values to correspond to a darker or more intense shade of one hue.
Here, color changes do not follow a monotonic lightness pattern, violating the expressiveness.
- White borders (stroke='white') disappear against few map fills, especially in light gray regions. Small states on the East Coast and islands become indistinguishable with same white background.
- A sequential numerical variable using a single-hue light-to-dark scheme (like “Blues”) makes the plot more effective. Using a categorical rainbow palette leads to misinterpretation, where color differences imply categorical distinctions rather than ordered magnitudes.


In [360]:
original = alt.Chart(states).mark_geoshape(stroke='white').encode(
    alt.Color('engineers:Q').bin().scale(domain=[0,0.002,0.004,0.006,0.008,0.01], 
                                  range=['#C5C5C5','#378FA3','#C60000','#FEC600','#007AC5'])
).transform_lookup(
    lookup='id',
    from_=alt.LookupData(stateinfo, 'id', ['engineers'])
).properties(
    width=500,
    height=300,
    title='Original'
).project(
    type='albersUsa'
)
updated = alt.Chart(states).mark_geoshape(
    stroke='black',
    strokeWidth=0.5
).encode(
    alt.Color('engineers:Q',
              bin=alt.Bin(step=0.002))
        .scale(scheme='blues')
        .legend(title='Engineers per capita'),
).transform_lookup(
    lookup='id',
    from_=alt.LookupData(stateinfo, 'id', ['engineers'])
).properties(
    width=500,
    height=300,
    title='Updated'
).project(
    type='albersUsa'
)
(original | updated).resolve_scale(color='independent')

### Plot 4

Flaws: - One stock has a much higher price (up to 700) while others remain below 150.
Because all lines share one y-axis scale, the smaller series appear flat and unreadable. This violates the effectiveness, since viewers can’t meaningfully compare trends for different stocks.
- The Y-axis is simply “price,” without including any units (USD).

In [220]:
original = alt.Chart(stocks).mark_line().encode(
    alt.X('date:T'),
    alt.Y('price:Q'),
    alt.Color('symbol:N')
).properties(width=600, title='Original')

stocks_updated = alt.Chart(stocks).mark_line().encode(
    alt.X('date:T')
        .axis(title='Date'),
    alt.Y('price:Q')
        .axis(title='Price (USD)'),
    alt.Color('symbol:N')
).properties(
    width=250,
    height=120
).facet(
    row='symbol:N',
    title='Updated'
).resolve_scale(
    y='independent'
)
(original | stocks_updated).resolve_scale(color='independent')

### Plot 5

Flaws: The bars are faceted into rows by MPAA_Rating, and they’re also colored by MPAA_Rating. That’s redundant.
- Inside each MPAA rating row, the real task is to compare genres for example, “In PG-13 films, which genre is rated highest on average?” However, the current color does not distinguish genres. Instead, all bars in a row share the same color because they share the same MPAA rating. A better choice is to use color to encode Major_Genre, not MPAA_Rating, so each bar’s color directly tells you what genre it belongs to

In [354]:
original = alt.Chart(movies).transform_filter(
    'datum.IMDB_Rating != null'
).transform_filter(
    alt.FieldOneOfPredicate(field='MPAA_Rating', oneOf=['G', 'PG', 'PG-13','R'])
).transform_filter(
    alt.FieldOneOfPredicate(field='Major_Genre', oneOf=['Action', 'Comedy', 'Drama','Horror'])
).mark_bar().encode(
    alt.Y('Major_Genre:N'),
    alt.X('mean(IMDB_Rating):Q'),
    alt.Row('MPAA_Rating:N'),
    alt.Color('MPAA_Rating:N')
).properties(
    title='Original'
)
updated = alt.Chart(movies).transform_filter(
    'datum.IMDB_Rating != null'
).transform_filter(
    alt.FieldOneOfPredicate(field='MPAA_Rating', oneOf=['G', 'PG', 'PG-13', 'R'])
).transform_filter(
    alt.FieldOneOfPredicate(field='Major_Genre', oneOf=['Action', 'Comedy', 'Drama', 'Horror'])
).mark_bar().encode(
    alt.Y('Major_Genre:N')
        .sort('-x')
        .axis(title='Major Genre'),

    alt.X('mean(IMDB_Rating):Q')
        .axis(title='Mean IMDB Rating'),
    alt.Row('MPAA_Rating:N')
        .sort(['G','PG','PG-13','R'])
        .title('MPAA Rating'),
    alt.Color('Major_Genre:N')
        .scale(scheme='tableau10')
        .legend(title='Major Genre')
).properties(
    title='Updated'
)
(original | updated).resolve_scale(color='independent')

### Plot 6

Flaws: 
- Every bar gets a unique color mapped to state. State is already on the Y-axis, so color is not adding new information. Instead, it creates a huge categorical legend that’s unreadable and visually noisy. This violates effectiveness because color can be used when it adds meaningful differentiation.
- Its hard to visualize the state with highest popuation or lowest population. Sorting based on population helps to compare relative population across states.

In [238]:
original = alt.Chart(stateinfo).mark_bar().encode(
    alt.Y('state:N'),
    alt.X('population:Q'),
    alt.Color('state:O').scale(scheme='plasma')
).properties(
    title='Original'
)
updated = alt.Chart(stateinfo).mark_bar().encode(
    alt.Y('state:N')
        .sort('-x')
        .axis(title='State'),
    alt.X('population:Q')
        .axis(title='Population')
).properties(
    title='Updated',
    width=400,
    height=600
)
(original | updated).resolve_scale(color='independent')

## Quantitative Colors

### Plot 7

Flaws:
- The bars are sorted with .sort(field='Acceleration'), but bar length is weight. This plot is not Effective because we’re comparing two unrelated metrics at once with no hierarchy.

In [277]:
original = alt.Chart(cars).mark_bar().transform_filter('datum.Cylinders==8 & year(datum.Year)== 1975').encode(
    alt.X('Weight_in_lbs:Q').title('Weight (lbs)'),
    alt.Y('Name:N').sort(field='Acceleration'),
    alt.Color('Acceleration:Q')
).properties(
    title='Original'
)
cars_updated = alt.Chart(cars).transform_filter(
    'datum.Cylinders==8 && year(datum.Year)==1975'
).mark_bar(stroke='black', strokeWidth=0.2).encode(
    alt.X('Weight_in_lbs:Q')
        .axis(title='Weight (lbs)'),
    alt.Y('Name:N')
        .sort('-x')
        .axis(title='Car'),
    alt.Color('Acceleration:Q')
        .scale(scheme='blues', reverse=True)
        .legend(title='Acceleration')
).properties(
    title='Updated',
    width=350,
    height=200
)

(original | cars_updated).resolve_scale(color='independent')

### Plot 8

Flaws:
- The color scheme maps lower MPG values to bright red and higher MPG values to a green hue, but the transition passes through a very light yellow tone in the middle. This draws attention to extremes rather than patterns. In addition, red-to-green is not perceptually uniform and can be confusing. Because higher fuel efficiency (MPG) is a positive attribute, using a unique shade of green for higher values makes far more intuitive sense.”

In [294]:
original = alt.Chart(cars).transform_joinaggregate(
    groupby=['Name'],
    yearcount = 'count()'
).transform_filter(
    'datum.yearcount > 2'
).mark_bar().encode(
    alt.X('year(Year):T').title('Year'),
    alt.Y('Name:N'),
    alt.Color('max(Miles_per_Gallon):Q').scale(domain=[10,25,40], range=['#FB261B','#F8F1CE','#1D7903'])
).properties(
    title='Original'
)
updated = alt.Chart(cars).transform_joinaggregate(
    groupby=['Name'],
    yearcount='count()',
    max_mpg_overall='max(Miles_per_Gallon)'
).transform_filter(
    'datum.yearcount > 2'
).mark_bar().encode(
    alt.X('year(Year):T')
        .axis(title='Year', format='%Y'),
    alt.Y('Name:N')
        .axis(title='Car Model'),
    alt.Color('max(Miles_per_Gallon):Q')
        .scale(scheme='greens')
        .legend(title='Max of Miles_per_Gallon')
).properties(
    title='Updated',
    width=400,
    height=450
)
(original | updated).resolve_scale(color='independent')

### Plot 9

Flaws: Jitter does prevent points from piling on top of each other, which is good, but it also spreads due to which cars no longer sit on a clean, single line per origin. To this i'm adding stroke color to reinforce Origin so we can distinguish regions even with jitter.


In [346]:
original = alt.Chart(cars).mark_circle().encode(
    alt.X('Horsepower:Q'),
    alt.Y('Origin:N'),
    alt.Color('Miles_per_Gallon:Q'),
    yOffset = 'jitter:Q'
).transform_calculate(
    # Generate Gaussian jitter with a Box-Muller transform
    jitter="sqrt(-2*log(random()))*cos(2*PI*random())"
).properties(
    height=300,
    title='Original'
)
updated = alt.Chart(cars).transform_calculate(
    jitter="sqrt(-2*log(random()))*cos(2*PI*random())"
).mark_circle(
    opacity=0.6,
    strokeWidth=0.7
).encode(
    alt.X('Horsepower:Q')
        .axis(title='Horsepower (HP)'),
    alt.Y('Origin:N')
        .axis(title='Origin'),
    alt.YOffset('jitter:Q'),

    alt.Color('Miles_per_Gallon:Q')
        .scale(scheme='blues')  # darker = higher MPG
        .legend(title='MPG (higher = more efficient)'),

    # Give Origin a visual identity using stroke color
    alt.Stroke('Origin:N')
        .scale(scheme='tableau10')
        .legend(title='Origin')
).properties(
    width=350,
    height=300,
    title='Updated'
)
(original | updated).resolve_scale(color='independent')