# SI649 W23 Altair Homework #1

## Overview 

For this assignment we're going to recreate a visualization from a FiveThirtyEight article (https://fivethirtyeight.com/features/competitive-hot-dog-eaters-have-made-america-great-again/), as well as some new and different ones. We'll be teaching you different pieces of Altair over the next few weeks so we'll focus on just a few basic chart times this time:

1.   Replicate 1 visualizations from the original article (slightly modified)
2.   Implementing 4 new visualizations according to our specifications


### Lab Instructions (read the full version on the handout of the previous lab)

*   Save, rename, and submit the ipynb file (use your username in the name).
*   Complete all the checkpoints, to create the required visualization at each cell
*   Run every cell (do Runtime -> Restart and run all to make sure you have a clean working version), print to pdf, submit the pdf file. 
*   For each visualization, we will ask you to write down a "Grammar of Graphics" plan first (basically a description of what you'll code).
*   If you end up stuck, show us your work by including links (URLs) that you have searched for. You'll get partial credit for showing your work in progress. 

You may also want to, on your own, go through some additional Altair tutorials:
- [UW Course](https://github.com/uwdata/visualization-curriculum)
- [Altair tutorial](https://github.com/altair-viz/altair-tutorial)

### Resources
- [Altair Documentation](https://altair-viz.github.io/index.html)
- [Colab Overview](https://colab.research.google.com/notebooks/basic_features_overview.ipynb)
- [Markdown Cheatsheet](https://www.markdownguide.org/cheat-sheet/)
- [Pandas DataFrame Introduction](https://pandas.pydata.org/pandas-docs/stable/getting_started/10min.html)
- Vega-Lite documentation
- Vega/Vega-Lite editor


In [61]:
# imports we will use
import altair as alt
import pandas as pd
import datetime as dt
from altair_saver import save
#from collections import defaultdict
alt.renderers.enable('html') #run this line if you are running jupyter notebook

RendererRegistry.enable('html')

In [62]:
# Load the data we'll need (available on Canvs)
df = pd.read_csv('hotdogs_clean.csv', header=0, index_col=0)
print(df.shape)
df.sample(5)

(2315, 7)


Unnamed: 0,Place,Consumed,Name,Contest,Location,Date,Minutes
36,2nd,28.75,Derek Hendrickson,Nathan's Famous Hot Dog Eating Contest Qualifi...,"Pleasanton, CA",2022-06-18,10.0
1817,10th,9.5,"Mike ""Diskoe"" Iskoe",Nathan's Famous Hot Dog Eating Contest qualifier,"West Chester, PA",2007-06-28,12.0
652,1st,29,"Juan ""more bite"" Rodriguez",Nathan's Famous Hot Dog Eating Contest Qualifi...,"St. Louis, MO",2016-04-16,10.0
1939,5th,14.75,"Don ""Moses"" Lerman",Nathan's Famous Hot Dog Eating Contest qualifier,"Brooklyn, NY",2006-07-03,12.0
1757,?,?,"Steve ""The Nos"" Spinosa",Nathan's Famous Hot Dog Eating Contest qualifier,"Philadelphia, PA",2008-05-24,10.0


In [63]:
# Drop rows with question marks (?) for Place or Consumed
df = df[(df['Place'] != '?') & (df['Consumed'] != '?')]
df.shape

(2037, 7)

In [64]:
# Check the data types
df.dtypes

Place        object
Consumed     object
Name         object
Contest      object
Location     object
Date         object
Minutes     float64
dtype: object

In [65]:
# Now that we've dropped the question marks, convert the Consumed and Date columns to the right data types
df['Consumed'] = df['Consumed'].astype('float')
df['Date'] = pd.to_datetime(df['Date'])
df.dtypes

Place               object
Consumed           float64
Name                object
Contest             object
Location            object
Date        datetime64[ns]
Minutes            float64
dtype: object

In [66]:
df.sample(5)

Unnamed: 0,Place,Consumed,Name,Contest,Location,Date,Minutes
1971,2nd,17.0,Eddie Hardy,Nathan's Famous Hot Dog Eating Contest qualifier,"Queens, NY",2006-06-18,12.0
567,6th,16.75,Nela Louise Zisser,Nathan's Famous Hot Dog Eating Contest - women,"Brooklyn, NY",2016-07-04,10.0
1389,1st,25.0,Crazy Legs Conti,Nathan's Famous Hot Dog Eating Contest qualifier,"Escondido, CA",2010-06-26,10.0
1011,10th,11.0,Kevin Sekulic,Nathan's Famous Hot Dog Eating Contest Qualifi...,"East Rutherford, NJ",2013-06-29,10.0
780,9th,6.75,Frank Herrero,Nathan's Famous Hot Dog Eating Contest Qualifi...,"Long Pond, PA",2015-06-07,10.0


In [67]:
# Finally, drop the data from 2017 onwards, and the very short and long contests
# This more closely matches the data used by FiveThirtyEight, although there are still some minor differences
df = df[(df['Date'] < dt.datetime(2017, 1, 1)) & (df['Minutes'] >= 5) & (df['Minutes'] < 60)]
df.shape

(1501, 7)

# Task #1

### First, let's examine the distribution of contests by length (in minutes). Recreate the visualization below

![](https://raw.githubusercontent.com/dallascard/SI649_public/main/altair_hw1/task1-final.png)

In [68]:
# Enable the FiveThirtyEight theme
#alt.theme.enable('fivethirtyeight') 

### Step 1: Write down your plan for the visualization (edit this cell)

*   mark type: bar
*   Encoding Specification:  
*   > x : minutes
*   > y : count()

Example encoding, if we had the nominal variable 'Location' and we wanted to use color, it would be:

color : Location : nominal


### Step 2: Create your chart, step by step

For each task, look at all the checkpoints. You can follow the checkpoint to work through the problem step-by-step. For each checkpoint, you should add code to the cell below it so as to create the required visualizaion. You can search for the keyword "TODO" to locate cells that need your edits


#### checkpoint 1: basic histogram chart. You will get full point if you:
 

*  Plot the right data
*  Specify the correct mark 
*  Use the correct x and y encoding 

You chart should look like: (it's okay if the grid lines don't exactly match)


![](https://raw.githubusercontent.com/dallascard/SI649_public/main/altair_hw1/task1-checkpoint1.png)

In [69]:
alt.Chart(df).mark_bar(color='#30a2da').encode(
    x=alt.X('Minutes', axis=alt.Axis(title='Minutes', tickCount=15)),  
    y=alt.Y('count()', axis=alt.Axis(title='Count of Records', tickCount=6))
).properties(
    width=410,
    height=350
)



#### checkpoint 2: basic bar chart with title and axis labels. You will get full point if you:

*  Completed checkpoint 1
*  Add the proper labels on x-axis and y-axis
*  Add a chart title

You chart should look like: (it's okay if the grid lines don't exactly match)


![](https://raw.githubusercontent.com/dallascard/SI649_public/main/altair_hw1/task1-final.png)

In [70]:
alt.Chart(df).mark_bar(color='#30a2da').encode(
    x=alt.X('Minutes', axis=alt.Axis(title='Contest Length (minutes)', tickCount=15)),  
    y=alt.Y('count()', axis=alt.Axis(title='Number of Contests', tickCount=6))
).properties(
    width=410,
    height=350,
    title = 'Distribution of contests by length'
)

# Task #2

### Now, let's recreate a visualization from the FiveThirtyEight article

### Here is the original:

![orig](https://fivethirtyeight.com/wp-content/uploads/2016/07/hickey-nathans-1.png?w=1150)


## We'll learn how to get closer to the original next week (using layering), but for now, we'll make a slightly modified version.

### Here is what you should aim to create:


![](https://raw.githubusercontent.com/dallascard/SI649_public/main/altair_hw1/task2-final-resized.png)

## Step 1: Write down your plan for the visualization (edit this cell)

*   mark type: point, filled
*   Encoding Specification:  
*   > x : date: O
*   > y : consumed: Q
*   > color : qualifier: N
*   > opacity : winner: N


## Step 2: Transform the relevant data using pandas

### We need to idenfity which records were for qualifiers, and which were winners



In [71]:
# First, let's identify the winners
# Note: there are many ways to do this; this is not compact, but it is fairly easy to understand

winners = []
for val in df['Place']:
    if val == '1st':
        winners.append('Winner')
    else:
        winners.append('Loser')
df['Winners'] = winners


In [72]:
# Repeat the above process to identify the records that correspond to qualifiers, and add a column to the dataframe

qualifiers = []
for val in df['Contest']:
    if val.lower().find('qualifier') > -1:
        qualifiers.append('Qualifier')
    else:
        qualifiers.append('Finals')
df['Qualifiers'] = qualifiers


## Step 3: Create your chart, step by step


#### checkpoint 1: basic scatter plot of Year vs Number of Hot Dogs Consumed. You will get full point if you:
 
*  Plot the right data
*  Specify the correct mark 
*  Use the correct x and y encoding (including converting dates to years)

You chart should look like:

![](https://raw.githubusercontent.com/dallascard/SI649_public/main/altair_hw1/task2-checkpoint1.png)


In [73]:
alt.Chart(df).mark_point(filled=True).encode(
    x='Date',
    y='Consumed',
)


#### checkpoint 2:  add color to the above scatterplot, corresponding to which records are qualifiers. You will get full point if you:
 
*  Completed checkpoint 1
*  Add a color channel to distinguish qualifiers from finals


You chart should look like:

![](https://raw.githubusercontent.com/dallascard/SI649_public/main/altair_hw1/task2-checkpoint2.png)


In [74]:
alt.Chart(df).mark_point(filled=True).encode(
    x='Date',
    y='Consumed',
    color='Qualifiers',
)


#### checkpoint 3: add opacity values corresponding to 1st place vs other. You will get full point if you:

 
*  Completed checkpoint 2
*  Add an opacity channel to distinguish 1st place winners vs all other competitors


You chart should look like:

![](https://raw.githubusercontent.com/dallascard/SI649_public/main/altair_hw1/task2-checkpoint3.png)


In [75]:
alt.Chart(df).mark_point(filled=True).encode(
    x='Date',
    y='Consumed',
    color='Qualifiers',
    opacity='Winners',
)


#### checkpoint 4:  adjust the colors and opacity levels to match the plot specification. You will get full point if you:


*  Completed checkpoint 3
*  Change the colors to be red and orange
*  Change the opacity levels to be specific values

You chart should look like:

![](https://raw.githubusercontent.com/dallascard/SI649_public/main/altair_hw1/task2-checkpoint4.png)


In [76]:
alt.Chart(df).mark_point(filled=True).encode(
    x='Date',
    y='Consumed',
    color=alt.Color('Qualifiers', scale=alt.Scale(domain=[ 'Finals','Qualifier'], range=['red', 'orange'])),
    opacity=alt.Opacity('Winners', scale=alt.Scale(domain=['Loser','Winner'], range=[0.1, 0.4]))
)

# Hint: you can set the target colors and opacity levels using scale=alt.Scale() as an argument of alt.Color() or alt.Opacity()
# good colors for the plot are "red" and "orange"
# good opacity levels are 0.4 and 0.1


#### checkpoint 5: add labels and title; adjust plot size; increase point size. You will get full point if you:

 
*  Completed checkpoint 4
*  Increase the mark size to 100
*  Remove the x-axis label
*  Change the y-axis label
*  Add a chart title (and subtitle if you can)
*  Change the plot dimensions to 500 x 400

You chart should look like:

![](https://raw.githubusercontent.com/dallascard/SI649_public/main/altair_hw1/task2-checkpoint5-resized.png)




In [77]:
alt.Chart(df).mark_point(filled=True, size=100).encode(
    x=alt.X('Date', axis=alt.Axis(title=None)),
    y=alt.Y('Consumed', axis=alt.Axis(title="Hot Dogs Consumed")),
    color=alt.Color('Qualifiers', scale=alt.Scale(domain=[ 'Finals','Qualifier'], range=['red', 'orange'])),
    opacity=alt.Opacity('Winners', scale=alt.Scale(domain=['Loser','Winner'], range=[0.1, 0.4]))
).properties(
    width=500,
    height=400,
    title={
        "text": '"Think we may wanna write this stuff down?"',
        "subtitle": "Available Nathan's Hot Dog Eating Contest data",
        "anchor": "start" 
    }
)




#### checkpoint 6 (BONUS): Move the legends to the top of the plot and make it horizontal with a larger font. You will get full point if you:

 
*  Completed checkpoint 5
*  Move the legends to the top of the plot
*  Lay them out horizontally
*  Increase the plot size in the legend

You chart should look like:

![](https://raw.githubusercontent.com/dallascard/SI649_public/main/altair_hw1/task2-final-resized.png)




In [78]:
#TODO: Replicate task 2, checkpoint 6


# Task #3

### Create another new plot, showing the maximum hot dogs consumed per year, in qualifiers and finals



Here is what you should aim to create:

![](https://raw.githubusercontent.com/dallascard/SI649_public/main/altair_hw1/task3-final.png)


## Step 1: Write down your plan for the visualization (edit this cell)

*   mark type: line with points
*   Encoding Specification:  
*   > x : year of date: O
*   > y : consume: max: Q
*   > color : qualifier: N

## Step 2: Create your chart, step by step


#### checkpoint 1: plot the maximum per year, with a different color for qualifiers vs finals. You will get full point if you:

 
*  Plot the right data
*  Specify the correct mark 
*  Use the correct x, y, and color encodings

You chart should look like:

![](https://raw.githubusercontent.com/dallascard/SI649_public/main/altair_hw1/task3-checkpoint1.png)

In [79]:
alt.Chart(df).mark_line().encode(
    x=alt.X('year(Date)'),
    y=alt.Y('Consumed', aggregate='max'),
    color='Qualifiers',
)



#### checkpoint 2: change the colors to match the target plot, add points, and clean up labels and title. You will get full point if you:

 
*  Completed checkpoint 1
*  Change the colors to match Task 2
*  add points to the line pot
*  Change the x-axis and y-axis labels to match the specificatio
*  Add a plot title

You chart should look like:

![](https://raw.githubusercontent.com/dallascard/SI649_public/main/altair_hw1/task3-final.png)



In [80]:
alt.Chart(df).mark_line(point=True).encode(
    x=alt.X('year(Date)', axis=alt.Axis(title='Year')),
    y=alt.Y('Consumed', aggregate='max', axis=alt.Axis(title='Maximum Hot Dogs Consumed')),
    color=alt.Color('Qualifiers', scale=alt.Scale(domain=[ 'Finals','Qualifier'], range=['red', 'orange'])),
).properties(
    width=450,
    height=350,
    title=alt.TitleParams(
        text = "Maximum hot dogs consumed per year",
        anchor = "start",
        fontSize = 20
    )
)

# hint, you can use mark_line(point=True) to add points to a line plot


# Task #4

### Create a pair of plots, showing the Winners of Finals of the 10 minute competitions



Here is what you should aim to create:

![](https://raw.githubusercontent.com/dallascard/SI649_public/main/altair_hw1/task4-final-fix2.png)


## Step 1: Write down your plan for the visualization (edit this cell)

Left chart:
*   mark type: point(filled)
*   Encoding Specification:  
*   > x: consumed: Q
*   > y: Name: N: sort by avg(consumed)


Right chart:
*   mark type: bar
*   Encoding Specification:  
*   > x: count(): Q
*   > y: Name: N: sort by avg(consumed)

Compound Method (how to join these charts together?): (A|B), share Y axis

## Step 2: select the relevant data using pandas

### Select the set of rows where:
- Place = 1st
- The competition is a NOT a qualifier
- The Duration is 10 minutes


In [81]:
df

Unnamed: 0,Place,Consumed,Name,Contest,Location,Date,Minutes,Winners,Qualifiers
540,1st,27.5,"Brian ""Dud Light"" Dudzinski",Nathan's Famous Hot Dog Eating Contest Qualifi...,"Nashville, TN",2016-10-11,10.0,Winner,Qualifier
541,2nd,25.5,"Eric ""Badlands"" Booker",Nathan's Famous Hot Dog Eating Contest Qualifi...,"Nashville, TN",2016-10-11,10.0,Loser,Qualifier
542,3rd,18.0,"""Nasty"" Nathan Biller",Nathan's Famous Hot Dog Eating Contest Qualifi...,"Nashville, TN",2016-10-11,10.0,Loser,Qualifier
543,1st,9.5,Liz McClurg,Nathan's Famous Hot Dog Eating Contest Qualifi...,"Nashville, TN",2016-10-11,10.0,Winner,Qualifier
544,2nd,8.5,Taylor Coombs,Nathan's Famous Hot Dog Eating Contest Qualifi...,"Nashville, TN",2016-10-11,10.0,Loser,Qualifier
...,...,...,...,...,...,...,...,...,...
2305,1st,10.0,Manel Hollenback,Nathan's Famous Hot Dog Eating Contest,"Brooklyn, NY",1978-05-29,6.5,Winner,Finals
2306,1st,10.0,Kevin Sinclair,Nathan's Famous Hot Dog Eating Contest,"Brooklyn, NY",1978-05-29,6.5,Winner,Finals
2307,1st,19.0,Jay Tierney,Nathan's Famous Hot Dog Eating Contest,"Atlantic City, NJ",1974-07-31,5.0,Winner,Finals
2311,1st,12.0,Melody Andorfer,Nathan's Famous Hot Dog Eating Contest,"Brooklyn, NY",1972-09-02,5.0,Winner,Finals


In [82]:
data = df[(df['Place'] == '1st') & (df['Minutes'] == 10) & (df['Qualifiers'] == 'Finals')]
data.shape
# Hint, refer back to where we excluded the "?" values above, for a hint on how to do this
# Hint, this should give you 26 rows, when selected from the filtered data frame above
# Note: there is some ambiguity in what counts as a qualifier. To get 26 rows, exclude every row where Contest contains the string "qualifer" (upper or lower case)
# Alternatively, if you get a slightly different number (e.g., 27) that will also be accepted, but will produce slightly different plots


(26, 9)

## Step 3: Create your chart


#### checkpoint 1: plot the number of hot dogs consumed by each competitor in the 10 minute finals they have won. You will get full point if you:

*  Plot the right data
*  Specify the correct mark 
*  Use the correct x and y encodings

You chart should look like:

![](https://raw.githubusercontent.com/dallascard/SI649_public/main/altair_hw1/task4-checkpoint1.png)

In [83]:
alt.Chart(data).mark_point(filled=True).encode(
    x='Consumed',
    y=alt.Y('Name', axis=alt.Axis(grid=True)),
)



#### checkpoint 2: sort the names by average number of hot dogs consumed in those competitions. You will get full point if you:

*  Completed checkpoint 1
*  Sort the names in the proper order

You chart should look like:

![](https://raw.githubusercontent.com/dallascard/SI649_public/main/altair_hw1/task4-checkpoint2.png)



In [84]:
alt.Chart(data).mark_point(filled=True).encode(
    x=alt.X('Consumed'
        ),
    y=alt.Y('Name',
            sort=alt.EncodingSortField(
                field="Consumed",
                op='average',
                order="descending"
            ),
            axis=alt.Axis(grid=True)
        )
)



#### checkpoint 3: fix the axis labels, and set the chart width to 200. You will get full point if you:

*  Completed checkpoint 2
*  Fix the x-axis labels
*  Remove the y-axis label
*  Set the plot width to be 200

You chart should look like:

![](https://raw.githubusercontent.com/dallascard/SI649_public/main/altair_hw1/task4-checkpoint3.png)



In [85]:
lefthalf=alt.Chart(data).mark_point(filled=True).encode(
    x=alt.X('Consumed',
            axis=alt.Axis(title='Hot Dogs Consumed in Winning Finals')
        ),
    y=alt.Y('Name',
            sort=alt.EncodingSortField(
                field="Consumed",
                op='average',
                order="descending"
            ),
            axis=alt.Axis(title=None, grid=True)
        )
).properties(
    width=200,
)
lefthalf


#### checkpoint 4: plot the number of 10 minute finals competitions won by each competitor as a new chart. You will get full point if you:

*  Plot the right data
*  Specify the correct mark 
*  Use the correct x and y encodings

You chart should look like:

![](https://raw.githubusercontent.com/dallascard/SI649_public/main/altair_hw1/task4-checkpoint4.png)


 

In [86]:
alt.Chart(data).mark_bar().encode(
    x=alt.X('count()'),
    y=alt.Y('Name')
)

#### checkpoint 5: apply the same sort order as above, and fix the axis labels and width. You will get full point if you:

*  Completed checkpoint 4
*  Applied the correct sort order
*  Fix the x-axis labels
*  Remove the y-axis label
*  Set the plot width to be 200

You chart should look like:

![](https://raw.githubusercontent.com/dallascard/SI649_public/main/altair_hw1/task4-checkpoint5.png)




In [87]:
righthalf=alt.Chart(data).mark_bar(color='#30a2da').encode(
    x=alt.X('count()', axis=alt.Axis(title='Number of Finals Won')),
    y=alt.Y('Name',
            sort=alt.EncodingSortField(
                field="Consumed",
                op='average',
                order="descending"
                ),
            axis=alt.Axis(title=None)
            )
).properties(
    width=200,
)
righthalf

#### checkpoint 6: place the two charts side by side. You will get full point if you:

*  Completed checkpoints 3 and 5
*  Placed the two charts side by side
*  Add a title to each chart

You chart should look like:
![](https://raw.githubusercontent.com/dallascard/SI649_public/main/altair_hw1/task4-final-fix2.png)


 

In [88]:
lefthalf = lefthalf.properties(
        title=alt.TitleParams(
            text = "Individual Results",
            anchor = "start",
            fontSize = 20
            )
        )
righthalf = righthalf.properties(
        title=alt.TitleParams(
            text = "Total Wins",
            anchor = "start",
            fontSize = 20
            )
        )
(lefthalf|righthalf)


*End of Assignment*

Please run all cells (Runtime->Run all), and 
1.  save to PDF 
    * We suggest using your browser's print feature: File->Print->Save PDF, you can try the notebook File->Download As->PDF, but we've noticed this doesn't work as well. If you're a Windows user and need help, take a look [here](https://www.digitaltrends.com/computing/print-pdf-windows/)
2.  save to ipynb (File -> Download .ipynb)

Rename both files with your uniqname: e.g. uniqname.pdf/ uniqname.ipynb
Upload both files to canvas. 
