# Anscombe's Quartet
Using data visualization to understand your data

<a href="https://en.wikipedia.org/wiki/Anscombe%27s_quartet">Anscombe's Quartet</a> is a great example to show the importance of fully understanding the variability in a data set. The goal of this exercise is to teach you never to trust summary statistics alone, however to always visualize your data.


Imagine you had four datasets of few points each, where each point consists of two values X and Y, and you want to characterise your data using some statistical tool. Let's start with some imports (which you are familiar with by now).

In [44]:
import pandas as pd
import altair as alt
from vega_datasets import data

Now, we will load the data as a dataframe

In [45]:
anscombe = data.anscombe()


Let's view the data. We can see that there are 4 data sets (I - IV) each with different values of X and Y

In [46]:
anscombe

Unnamed: 0,Series,X,Y
0,I,10,8.04
1,I,8,6.95
2,I,13,7.58
3,I,9,8.81
4,I,11,8.33
5,I,14,9.96
6,I,6,7.24
7,I,4,4.26
8,I,12,10.84
9,I,7,4.81


How about taking a quick summary statistics of the data. Keep an eye to know if you will notice anything in the summary

In [47]:
anscombe.groupby('Series').describe()

Unnamed: 0_level_0,X,X,X,X,X,X,X,X,Y,Y,Y,Y,Y,Y,Y,Y
Unnamed: 0_level_1,count,mean,std,min,25%,50%,75%,max,count,mean,std,min,25%,50%,75%,max
Series,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2
I,11.0,9.0,3.316625,4.0,6.5,9.0,11.5,14.0,11.0,7.5,2.03289,4.26,6.315,7.58,8.57,10.84
II,11.0,9.0,3.316625,4.0,6.5,9.0,11.5,14.0,11.0,7.500909,2.031657,3.1,6.695,8.14,8.95,9.26
III,11.0,9.0,3.316625,4.0,6.5,9.0,11.5,14.0,11.0,7.5,2.030424,5.39,6.25,7.11,7.98,12.74
IV,11.0,9.0,3.316625,8.0,8.0,8.0,8.0,19.0,11.0,7.500909,2.030579,5.25,6.17,7.04,8.19,12.5


We notice that the summary statistics are, if not the same very similar (Mean: X=9 and Y=7.5). This means they exhibit the statistical properties, therefore should the same when visualized right? However that's not the case... 


Plotting the data in two dimensions will show otherwise... Let's visualize the data!

In [60]:
def create_anscombe(data):
    anscombe_chart = alt.Chart(data).mark_circle().encode(
        alt.X('X'),
        alt.Y('Y'),
        alt.Facet('Series'),
        #tooltip= ['X', 'Y']
    ).properties(
        width=150,
        height=150
    )
    
    return anscombe_chart

In [61]:
create_anscombe(anscombe)

Viola! Plotting the data reveals that the data set are drastically different in appearance.


Anscombe's quartet is often used as an example to justify that a statistical summary of a data set will naturally lose information and so should be accompanied by further study and understanding such as visualizing the data.

# Exercise

1. Try axis ordering by scaling the X and Y axis to eliminate zero
2. Try adding datatypes for the two axis
3. Also make the chart interactive and add a tooltip

Data Visualization methods helps us to understand better patterns within the data. We have seen that when data are represented as effectively designed visualization, the human visual system can easily extract patterns from the data. However, some visual tasks (eg: finding a particular plot in a scatterplot) requires some operations.


Here we are going to look at two visual tasks for finding patterns in a dataset.

## Filtration Task

In [51]:
anscombe_ID = list(anscombe['Series'].unique()) #list the series column so you understand what to plot

In [66]:
anscombe_ID

['I', 'II', 'III', 'IV']

In [54]:
#provides an aggregation of the data
anscombe_pivot = pd.pivot_table(anscombe, values=['X', 'Y'], index=['Series']) 

In [55]:
anscombe_pivot 

Unnamed: 0_level_0,X,Y
Series,Unnamed: 1_level_1,Unnamed: 2_level_1
I,9,7.5
II,9,7.500909
III,9,7.5
IV,9,7.500909


In [62]:
#Plot the data as a scatterplot
def anscombe_filter(data):
    anscombe_chart = alt.Chart(data).mark_circle().encode(
        alt.X('X'),
        alt.Y('Y'),
        color = 'Series',
        #tooltip= ['X', 'Y']
    ).interactive()
    
    return anscombe_chart

In [63]:
anscombe_filter(anscombe)

# Exercise

Perform filtering task

1. Make the chart interactive (Implement panning, zooming and tooltip)
2. Implement filtering for the chart so when you hover over a Series (for eg: IV) only the pattern shows

## Selection Task

In [89]:
def anscombe_select(data):
    
    #create the dropdown menu
    input_dropDown = alt.binding_select(options = anscombe_ID)
    
    selectDropdown = alt.selection_single(
        name= 'Select',
        fields= ['Series'],
        bind=input_dropDown
    )
    
    
    #conditions for selection
    opacity = alt.condition(selectDropdown, alt.value(1.00), alt.value(0.20))
    
    anscombe_chart = alt.Chart(data).mark_circle().encode(
        x='X',
        y='Y',
        tooltip= ['X', 'Y'],
        opacity= opacity
    ).add_selection(
        selectDropdown
    )
    
    return anscombe_chart

In [90]:
anscombe_select(anscombe)

# Exercise


1. Change the encodings, instead of the opacity, use color and encode the non-selected in a grayscale

# Homework

Use everything you learned in this notebook to explore and analyze DatasaurusDozen.tsv data set. Use pandas' read_csv function, with adequate parameters, to import the data.

Viel Spaß!
